Statistical Analysis of Texas Bridges Using Regression Modelling¶

Author: Rashad Malik

Project aim and outline¶

The aim of this project is to conduct a comprehensive analysis of bridges in Texas US, to understand the predictive power of specific variables on bridge conditions. Through data preparation, exploratory analysis, and regression modelling, the project seeks to determine how factors like bridge age, average usage, percentage of truck traffic, material, and design impact the overall bridge condition.

This analysis explores key predictors of structural health to provide insights that could help guide future maintenance plans and resource allocation. The project also aims to provide transparent methodologies and assumptions to enable informed decision-making.

The notebook contains the following sections:

  • Introduction
    • Importing required libraries, describing the dataset and loading the data
  • Analysis
    • Part 1: Data preparation
    • Part 2: Exploratory analysis
    • Part 3: Regression modelling
  • Summary and conclusion
  • References

Introduction¶

Importing libraries¶

The following libraries are required for our analysis:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, root_mean_squared_error
%matplotlib inline

The dataset¶

Dataset description and variables¶

The data was obtained from the US Department of Transportation's Federal Highway Administration (FHA), specifically from their National Bridge Inventory database.

For this project, we are using a greatly simplified subset of the data:

  • Our data contains bridge data related to Texas only.
  • Culverts have been excluded from the dataset, leaving us only with bridges for analysis.

Below is a description of the variables within our dataset:

The original FHWA dataset has over 100 variables; ours is simplified. Both continuous and categorical variables are included.

Variable Description Type
Structure_id Unique identifier of the bridge String
District Highway district in Texas responsible for bridge category
Detour_Km Length of detour if bridge closed continuous
Toll Whether a toll is paid to use bridge category
Maintainer The authority responsible for maintenance category
Urban Whether the bridge is located in an urban or rural area category
Status The road class: interstate to local category
Year The year the bridge was built continuous
Lanes_on The number of lanes that run over the bridge continuous (or discrete)
Lanes_under The number of lanes that run under the bridge continuous (or discrete)
AverageDaily The average daily traffic (number of vehicles) continuous
Future_traffic The estimated daily traffic in approx 20 years time continuous
Trucks_percent The percent of traffic made up of 'trucks' (i.e. lorries) continuous
Historic Whether the bridge is historic category
Service_under The (most important) service that runs under the bridge category
Material The dominant material the bridge is made from category
Design The design of the bridge category
Spans The number of spans the bridge has category (or discrete)
Length The length of the bridge in metres continuous
Width The width of the bridge in metres continuous
Rated_load The rated max loading of bridge (in tonnes) continuous
Scour_rating Only for bridges over water: the 'scour' condition ordinal
Deck_rating The condition of the deck of the bridge ordinal
Superstr_rating The condition of the bridge superstructure ordinal
Substr_rating The condition of the bridge substructure (foundations) ordinal

Note on 'scour': when a bridge is over (for example) a river, the flow of water in the river can undermine any bridge supports (called 'piers') in the water. This is called 'scouring' . The Scour_rating gives the condition with respect to possible damage from scouring.

Values of categorical variables: In the original data, the values of the categorical variables are represented as integers, with their meanings given in a data dictionary. In our dataset, these 'numeric codes' have been replaced with suitable names.

Variable Values
District Each district has a unique number
Toll Toll, Free
Maintainer State, County, Town or City, Agency, Private, Railroad, Toll Authority, Military, Unknown
Urban Urban, Rural
Status Interstate, Arterial, Minor, Local
Historic Register, Possible, Unknown, Not historic
Service_under Other, Highway, Railroad, Pedestrian, Interchange, Building
Material Other, Concrete, Steel, Timber, Masonry
Design Other, Slab, Beam, Frame, Truss, Arch, Suspension, Movable, Tunnel, Culvert, Mixed
Scour_rating Unknown, Critical, Unstable, Stable, Protected, Dry, No waterway
Deck_rating Rating: NA, Excellent, Very Good, Good, Satisfactory, Fair, Poor, Serious, Critical, Failing, Failed
Superstr_rating Rating
Substr_rating Rating

Data loading¶

Below, we load the dataset from a CSV file and explicitly define the data types for each variable. By using a "type map", we avoid the default behaviour of treating non-numeric fields as strings, allowing us to define categorical variables accurately.

Categorical variables, such as Structure_id, Toll, and Material, are assigned the category data type. For ordinal variables (those with a meaningful order), we use pd.CategoricalDtype with a specified order. For instance, the Deck_rating, Superstr_rating, and Substr_rating fields are assigned a rating_type with categories ordered from 'Failed' to 'Excellent'. Similarly, Scour_rating is given a scour_type with categories that range from 'Unknown' to 'No Waterway'.

This approach ensures that categorical and ordinal variables are correctly interpreted by the analysis, improving data clarity and consistency. Finally, the dataset is loaded with Structure_id as the index column, which will help us with handling our data.

In [2]:
# Defining the categorical data type for bridge ratings with an explicit order
rating_type = pd.CategoricalDtype(
    categories=["Failed", "Failing", "Critical", "Serious", "Poor", "Fair",
                "Satisfactory", "Good", "Very Good", "Excellent", "NA"],
    ordered=True)

# Defining the categorical data type for scour ratings with an explicit order
scour_type = pd.CategoricalDtype(
    categories=["Unknown", "Critical", "Unstable", "Stable", "Protected", "Dry", "No Waterway"],
    ordered=True)

# Defining a dictionary specifying data types for each column in the dataset
types_dict = {
    "Structure_id": str,
    "Toll": "category",
    "Maintainer": "category",
    "Urban": "category",
    "Status": "category",
    "Historic": "category",
    "Service_under": "category",
    "Material": "category",
    "Design": "category",
    "Deck_rating": rating_type,
    "Superstr_rating": rating_type,
    "Substr_rating": rating_type,
    "Scour_rating": scour_type
}

# Load the dataset with specified data types and set Structure_id as the index column
bridges = pd.read_csv("data/tx19_bridges_sample.csv", dtype=types_dict, index_col="Structure_id")

# Print dataframe information
bridges.info()
<class 'pandas.core.frame.DataFrame'>
Index: 34293 entries, 000021521-00101 to DAPTRABLI000011
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   District         34293 non-null  object  
 1   Detour_Km        34293 non-null  int64   
 2   Toll             34293 non-null  category
 3   Maintainer       34293 non-null  category
 4   Urban            34293 non-null  category
 5   Status           34293 non-null  category
 6   Year             34293 non-null  int64   
 7   Lanes_on         34293 non-null  int64   
 8   Lanes_under      34293 non-null  int64   
 9   AverageDaily     34293 non-null  int64   
 10  Historic         34293 non-null  category
 11  Service_under    34293 non-null  category
 12  Material         34293 non-null  category
 13  Design           34293 non-null  category
 14  Spans            34293 non-null  int64   
 15  Length           34293 non-null  float64 
 16  Width            34293 non-null  float64 
 17  Deck_rating      34288 non-null  category
 18  Superstr_rating  34291 non-null  category
 19  Substr_rating    34293 non-null  category
 20  Rated_load       34293 non-null  float64 
 21  Trucks_percent   34293 non-null  float64 
 22  Scour_rating     24520 non-null  category
 23  Future_traffic   34293 non-null  int64   
dtypes: category(12), float64(4), int64(7), object(1)
memory usage: 4.8+ MB

We can see that the data has been successfully loaded, and our dataframe bridges contains 34'293 rows (bridges), and 24 columns (variables).

We can also see some null (missing) cells in the Deck_rating and Superstr_rating. These columns will be necessary for our analysis later, so we will eventually need to address this issue.

Analysis¶

Part 1: Data preparation¶

1.1 Deriving the age of bridges¶

As part of the analysis, I needed to find how the age of a bridge may influence its condition. However, the dataset does not contain this information directly, therefore I created a new variable Age which is derived from the Year variable.

In [3]:
# Initialising the new Age column with NaN values
bridges["Age"] = np.NaN
bridges["Age"] = bridges["Age"].astype(object)

# Calculating bridge age
bridges["Age"] = 2024 - bridges["Year"]

The new variable Age has now been created, lets check some quick summary statistics for this column.

In [4]:
bridges["Age"].describe()
Out[4]:
count    34293.000000
mean        42.502581
std         23.860135
min          5.000000
25%         22.000000
50%         39.000000
75%         60.000000
max        124.000000
Name: Age, dtype: float64

The summary statistics for bridge age show a large range, with ages spanning from 5 to 124 years. The average age is approximately 42.5 years, and the distribution is fairly spread out, with a standard deviation of 23.86 years, reflecting a wide variety of Texan bridge ages.

1.2 Excluding older bridges¶

Since this analysis focuses on factors influencing the condition of bridges, which in turn could inform bridge maintenance and planning strategies, I needed to decide what to do with older bridges.

The inclusion of older bridges in the analysis may affect the results, as they may have their own specialised maintenance routines. Therefore I trimmed the data to remove some of these bridges. My approach to removing older bridges was the following:

  • Investigate the Historic column, and check if this information could be used for the selection.
  • Explore the age distribution of bridges.

First, let's explore the Historic column. As described on the FHA's detailed code mapping for individual data items page, this column consists of the following four possible categories:

  • Register: Bridge is on the National Register.
  • Possible: Bridge is eligible for the National Register.
  • Unknown: Historic significance of the bridge has not been determined.
  • Not historic: Bridge is not eligible for the National Register, and is not in a historic district eligible for the National Register.

I constructed a violin plot to visualise the distribution of this variable against the bridge age.

In [5]:
# Violin plot
sns.set_style("whitegrid")
plt.figure(figsize = (7, 4))
sns.violinplot(bridges, x="Historic", y="Age", cut=0, alpha=0.8)

plt.xlabel("Historic status", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.ylabel("Bridge age", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(5))
plt.grid(which="minor", axis="y", linestyle=":", linewidth=0.7)
plt.ylim(bottom=0)

plt.text(-1, 148, 
         "Age distribution of bridges", 
         size = 14, weight = "bold", color = "black")
plt.text(-1, 138, 
         "by historic status", 
         size = 13, color = "black")
plt.text(-1, -38, 
         "RASHAD MALIK" + " " * 42 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

From the violin plot, we can observe that registered historic bridges tend to be the oldest, while non-historic and unknown categories generally include younger structures. However, this is not a guarantee, as we can see that both the registered and possible also contain bridges that stretch into the younger ages.

We can create a table with the ranges to see the minimum and maximum ages within each category:

In [6]:
# Grouping the data by historic category, and calculating minimum and maximum ages
age_stats = bridges.groupby("Historic", observed=False)["Age"].agg(["min", "max"])
age_stats
Out[6]:
min max
Historic
Not historic 5 94
Possible 6 124
Register 6 124
Unknown 7 64

Since this Historic variable has such a wide range of ages irregardless of the category, I decided not to use it for excluding older bridges. Instead, I tried a different approach, by analysing bridge ages.

First, I visualised the ages of the bridges on a probability density plot, and highlighted the 90th percentile. This shows the oldest 10% of bridges.

In [7]:
# Calculating the 90th percentile of the bridge ages
percentile_90 = bridges["Age"].quantile(0.9)

# Plotting the probability density plot
sns.kdeplot(bridges, x="Age", fill=True)
plt.xlabel("Bridge age", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.ylabel("Density", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(0.0005))
plt.grid(which="minor", axis="y", linestyle=":", linewidth=0.7)
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(5))
plt.grid(which="minor", axis="x", linestyle=":", linewidth=0.7)

# Adding a vertical line at the 90th percentile
plt.axvline(percentile_90, color="orange", linestyle="-", label=f"90th Percentile ({percentile_90:.2f})")

plt.text(-36.5, 0.02, 
         "Probability density of bridge age", 
         size = 14, weight = "bold", color = "black")
plt.text(-36.5, 0.0188, 
         "with 90th percentile highlighted", 
         size = 13, color = "black")
plt.text(-36.5, -0.0043, 
         "RASHAD MALIK" + " " * 42 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.legend()
plt.show()
No description has been provided for this image

Since the historic status alone does not determine whether a bridge is unusually old, I decided to exclude bridges above the 90th percentile (age 74), as it provides a more data-driven approach to filter out very old bridges while retaining a representative mix of ages across all historic categories.

This approach preserves the distribution and diversity of bridge ages without disproportionately removing entries based on historic classification, which may introduce unnecessary bias into the analysis.

In [8]:
# Excluding bridges over the 90th percentile age
bridges_filtered = bridges[bridges["Age"] <= percentile_90]

1.3 Reducing the number of categories for Materials and Design¶

The categorical variables Material and Design will be used in the analysis. However, I needed to investigate if there are categories within these variables that contain a small number of bridges. Below are bar plots of these two variables.

In [9]:
# Plotting histograms for "Design" and "Material"
fig, axes = plt.subplots(1, 2, figsize=(12.5, 5))
colors = sns.color_palette("muted")

# Histogram for Material
bridges_filtered["Material"].value_counts().plot(kind="bar", ax=axes[0], color=colors[2], alpha=0.8)
axes[0].set_xlabel("Material", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[0].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[0].tick_params(axis="x", rotation=0, labelsize=10)
axes[0].yaxis.set_minor_locator(plt.MultipleLocator(1000))
axes[0].grid(which="minor", axis="y", linestyle=":", linewidth=0.5)

# Histogram for Design
bridges_filtered["Design"].value_counts().plot(kind="bar", ax=axes[1], color=colors[4], alpha=0.8)
axes[1].set_xlabel("Design", fontsize = 12, fontweight = "bold", labelpad = -3)
axes[1].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 5)
axes[1].tick_params(axis="x", rotation=30, labelsize=10)
axes[1].yaxis.set_minor_locator(plt.MultipleLocator(1000))
axes[1].grid(which="minor", axis="y", linestyle=":", linewidth=0.5)

plt.text(-11.6, 31000, 
         "Bar plots of categorical variables", 
         size = 14, weight = "bold", color = "black")
plt.text(-11.6, 29200, 
         "Bridge material and design", 
         size = 13, color = "black")
plt.text(-11.6, -6900, 
         "RASHAD MALIK" + " " * 140 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

I can see some categories in both variables with a very small number of bridges relative to the dominant categories. This can cause issues for the analysis for a number of reasons:

  • Categories with a small number of bridges can introduce noise and variability, making it harder for statistical models to find stable patterns.
  • More categories can make it harder to interpret model results and understand the influence of each group.

To help avoid these issues, I used the following approach to reduce the number of categories:

  • For Material, concrete and steel bridges have much larger counts compared to the rest, so I collapsed the remaining materials into the "Other" category.
  • For Design, the beam is by far the most common. The slab design is much less common, however it is still more frequent when compared to the rest. So similarly as before, I also collapsed the remaining designs into the "Other" category.
In [10]:
# Assigning "Other" to non-matching values in "Material" and "Design"
bridges_filtered = bridges_filtered.copy()
bridges_filtered.loc[~bridges_filtered["Material"].isin(["Concrete", "Steel"]), "Material"] = "Other"
bridges_filtered.loc[~bridges_filtered["Design"].isin(["Beam", "Slab"]), "Design"] = "Other"

# Removing unused categories in both "Material" and "Design"
bridges_filtered["Material"] = bridges_filtered["Material"].cat.remove_unused_categories()
bridges_filtered["Design"] = bridges_filtered["Design"].cat.remove_unused_categories()

1.4 Deriving the current bridge condition¶

In the analysis, I used three variables to describe the condition of the bridges:

  • Deck_rating: The condition of the deck of the bridge.
  • Superstr_rating: The condition of the bridge superstructure.
  • Substr_rating: The condition of the bridge substructure (foundations).

For regression analysis, target variables are typically continuous, however, these ratings are ordinal. I needed to turn these ratings into something continuous so that I could conduct the regression analysis later in this project.

One approach is turning these ratings into "scores". I created three new variables: Deck_score, Superstr_score, and Substr_score. I used the following mapping for these columns:

Category Value
Failed 0
Failing 1
Critical 2
Serious 3
Poor 4
Fair 5
Satisfactory 6
Good 7
Very Good 8
Excellent 9
In [11]:
# Defining the mapping
rating_to_score = {
    "Failed": 0,
    "Failing": 1,
    "Critical": 2,
    "Serious": 3,
    "Poor": 4,
    "Fair": 5,
    "Satisfactory": 6,
    "Good": 7,
    "Very Good": 8,
    "Excellent": 9
}

# Applying the mapping and creating new "score" columns
bridges_filtered["Deck_score"] = bridges_filtered["Deck_rating"].map(rating_to_score)
bridges_filtered["Superstr_score"] = bridges_filtered["Superstr_rating"].map(rating_to_score)
bridges_filtered["Substr_score"] = bridges_filtered["Substr_rating"].map(rating_to_score)

I also created a fourth column, called Score, which sums the three scores and gives the target variable that I used to assess the condition of all bridges moving forwards.

In [12]:
# Calculating total score
bridges_filtered["Score"] = (bridges_filtered["Deck_score"] +
                                          bridges_filtered["Superstr_score"] +
                                          bridges_filtered["Substr_score"]
                                         )

Before continuing, as I noticed in the preliminary exploration at the start of the project, some of these variables were missing values. Let's investigate:

In [13]:
# Checking for NaN values in the three scores
print("NaN in Deck_score:", bridges_filtered["Deck_score"].isna().sum())
print("NaN in Superstr_score:", bridges_filtered["Superstr_score"].isna().sum())
print("NaN in Substr_score:", bridges_filtered["Substr_score"].isna().sum())
NaN in Deck_score: 5
NaN in Superstr_score: 1
NaN in Substr_score: 0

There are 5 bridges with empty deck scores, and 1 with an empty superstructure score. If nothing is done about these missing values, then those 6 bridges would have lower misleading scores.

To account for these missing values, I found the average of the two available scores per bridge, and added that to the two scores. This scales the 6 bridges scores appropriately while still maintaining a representative view of their condition.

In [14]:
# Update Score for rows with missing Deck and Superstr scores
bridges_filtered.loc[bridges_filtered["Deck_score"].isna(), "Score"] = (((bridges_filtered["Superstr_score"] +
                                                                               bridges_filtered["Substr_score"]) / 2) +
                                                                              bridges_filtered["Superstr_score"] + 
                                                                              bridges_filtered["Substr_score"]
                                                                             )
bridges_filtered.loc[bridges_filtered["Superstr_score"].isna(), "Score"] = (((bridges_filtered["Deck_score"] +
                                                                                    bridges_filtered["Substr_score"]) / 2) +
                                                                                  bridges_filtered["Deck_score"] +
                                                                                  bridges_filtered["Substr_score"]
                                                                                 )
# Filter rows where both Deck_score and Superstr_score are NaN
filtered_na = bridges_filtered[bridges_filtered["Deck_score"].isna() | bridges_filtered["Superstr_score"].isna()]
filtered_na[["Deck_score","Superstr_score","Substr_score","Score"]]
Out[14]:
Deck_score Superstr_score Substr_score Score
Structure_id
010920004518118 NaN 7.0 7.0 21.0
031690AA0273001 NaN 8.0 8.0 24.0
121020B37610001 NaN 7.0 7.0 21.0
131580AA0323001 NaN 8.0 7.0 22.5
190190102001006 NaN 7.0 7.0 21.0
211090AA0348002 1.0 NaN 4.0 7.5

I can see that the logic I defined is working for the six bridges, and they now have scores scaled up accordingly. Let's view a histogram of the new Score column:

In [15]:
# Plot histogram of Score
plt.figure(figsize=(7, 4))
plt.hist(bridges_filtered["Score"], bins=27, alpha=0.8)
plt.xlabel("Score", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(500))
plt.grid(which="minor", axis="y", linestyle=":", linewidth=0.7)
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(1))
plt.grid(which="minor", axis="x", linestyle=":", linewidth=0.7)

plt.text(-5.3, 9820, 
         "Histogram of Score (bins=27)", 
         size = 14, weight = "bold", color = "black")
plt.text(-5.3, 9120, 
         "Distribution of bridge condition scores", 
         size = 13, color = "black")
plt.text(-5.3, -2600, 
         "RASHAD MALIK" + " " * 42 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

The distribution is highly concentrated around higher values, with most scores ranging between 15 and 25.

  • The highest frequency appears around scores of 19 to 21, showing that many bridges are in relatively decent condition.
  • Only a small number of bridges have low scores, with very few below a score of 10, which indicates that there are few bridges in poor condition.
  • Given that the maximum possible score is 27, the distribution shows that the majority of bridges are in satisfactory to good condition overall.
  • The right tail tapers off as scores approach the maximum, meaning that there are fewer bridges in near-perfect condition.

Part 2: Exploratory analysis¶

In order to conduct a successful regression analysis, I needed to explore the relationships between the predictors and the target variable.

What I needed to watch out for is collinearity (when predictor variables are highly correlated), as this would impact the quality of the analysis. Also, I needed to explore continuous and categorical variables separately, as they have different approaches when investigating signs of collinearity.

2.1 Continuous variables¶

The continuous variables I used in the analysis are the following:

  • Age: The age of the bridge, in years.
  • AverageDaily: The average daily recorded bridge traffic, in number of vehicles.
  • Trucks_percent: The proportion of traffic that are trucks (or lorries), in percentage.

I began by plotting a scatter matrix, to visualise the distribution of these variables alongside the condition score.

In [16]:
# Creating a scatter matrix
scatter_matrix = pd.plotting.scatter_matrix(
    bridges_filtered[["Age", "AverageDaily", "Trucks_percent", "Score"]],
    figsize=(12, 12),
    marker="o",
    alpha=0.3,
    diagonal="hist",
)

for ax in scatter_matrix.ravel():
    ax.grid(False)
    ax.set_xlabel(ax.get_xlabel(), fontsize=12, rotation=0, labelpad = 8, fontweight = "bold")
    ax.set_ylabel(ax.get_ylabel(), fontsize=12, rotation=90, labelpad = 8, fontweight = "bold")
    ax.xaxis.label.set_size(14)
    ax.yaxis.label.set_size(14)
    ax.tick_params(axis='x', rotation=0)

plt.text(-95, 78000, 
         "Scatter matrix of continuous variables", 
         size = 14, weight = "bold", color = "black")
plt.text(-95, 76000, 
         "Comparing Score, Age, AverageDaily, and Trucks_percent", 
         size = 13, color = "black")
plt.text(-95, -6300, 
         "RASHAD MALIK" + " " * 128 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

From looking at the scatter matrix, it is difficult to visually notice clear signs of collinearity between the variables, or any clear patterns when comparing to the score. Therefore, I used a mathematical approach.

I computed the correlation coefficients (also known as the $r$ values) between each variable, to help understand their relationship:

In [17]:
# Creating the heatmap with correlation coefficients
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
sns_heatmap = sns.heatmap(
    bridges_filtered[["Age", "AverageDaily", "Trucks_percent", "Score"]].corr(numeric_only=True),
    vmin=-1, vmax=1, 
    cmap="coolwarm",
    annot=True, 
    fmt=".4f",
    annot_kws={"size": 12, "color": "black"},
    linewidths=0.1,
    linecolor='grey', 
    ax=ax
)

plt.yticks(rotation=0, fontsize=9, fontweight = "bold")
plt.xticks(rotation=0, fontsize=9, fontweight = "bold")
colorbar = sns_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=9, rotation=0)

plt.text(-1.1, -0.5, 
         "Heatmap with correlation coefficients", 
         size = 14, weight = "bold", color = "black")
plt.text(-1.1, -0.22, 
         "Comparing continuous variables", 
         size = 13, color = "black")
plt.text(-1.1, 4.8, 
         "RASHAD MALIK" + " " * 30 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

The heatmap shows the correlation coefficients between the variables Age, AverageDaily, Trucks_percent, and Score. These correlations help understand the relationships between the variables and identify any signs of collinearity.

  • There is a moderately strong negative correlation between Age and Score (correlation coefficient of -0.58). This tells us that as the age of a bridge increases, the condition of the bridge tends to decrease, indicating older bridges generally have lower condition scores.
  • The correlation between Age and Trucks_percent is 0.23, which is a weak positive relationship. This tells us that older bridges tend to have a slightly higher proportion of truck traffic, but the relationship is not strong.
  • AverageDaily has very low correlation coefficients with the other variables. This means that the average daily traffic across the bridges seems to have little or nothing to do with the bridge age, truck percentage, or condition score.
  • The correlation between Trucks_percent and Score is -0.042, which is very close to zero. This suggests there is essentially no linear relationship between the percentage of trucks using a bridge and the bridge's overall condition.

To summarise these observations, the only notable relationship is between Age and Score, where older bridges tend to have lower condition scores. However, there are no significant correlations among the predictor variables themselves, so collinearity is not a concern in this dataset. This suggests that all the continuous predictors can be included in the regression model without the risk of collinearity affecting model stability or quality.

Since Age is the only notable relationship with the score, I visualised the variable differently from the scatter matrix made earlier. I calculated and plotted the average score by age:

In [18]:
# Calculate average Total Score for each Age
age_score_avg = bridges_filtered.groupby("Age")["Score"].mean()

# Create lineplot
plt.figure(figsize=(7, 4))
age_score_avg.plot(kind="line")
plt.xlabel("Age", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.ylabel("Average score", fontsize = 12, fontweight = "bold", labelpad = 10)
plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(0.2))
plt.grid(which="minor", axis="y", linestyle=":", linewidth=0.7)
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(2))
plt.grid(which="minor", axis="x", linestyle=":", linewidth=0.7)

plt.text(-7, 25, 
         "Line-plot", 
         size = 14, weight = "bold", color = "black")
plt.text(-7, 24.5, 
         "Average bridge score by age", 
         size = 13, color = "black")
plt.text(-7, 16.35, 
         "RASHAD MALIK" + " " * 40 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

This provides a clearer visual confirmation of the negative relationship between the bridge age and the bridge condition score.

2.2 Categorical variables¶

The categorical variables I used in the analysis are the following:

Variable Description Categories
Material The dominant material the bridge is made from. - Concrete
- Steel
- Other (timber, masonry, or other materials)
Design The design of the bridge. - Beam
- Slab
- Other (arch, frame, truss, movable, suspension, or other bridge designs)

I began by creating a violin plot, to visualise the distribution of these variables alongside the condition score.

In [19]:
# Set up the figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(13, 5), sharey=False)

material_order = ["Concrete", "Steel", "Other"]
design_order = ["Beam", "Slab", "Other"]

# Violin plot for Total Score by Material
sns.violinplot(data=bridges_filtered, x="Material", y="Score", ax=axes[0], alpha=0.8, cut=0, order=material_order, color=colors[4])
axes[0].set_xlabel("Material", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[0].set_ylabel("Score", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[0].tick_params(axis="x", rotation=0, labelsize=10)
axes[0].yaxis.set_minor_locator(plt.MultipleLocator(1))
axes[0].grid(which="minor", axis="y", linestyle=":", linewidth=0.5)
axes[0].set_ylim(bottom=0)

# Violin plot for Total Score by Design
sns.violinplot(data=bridges_filtered, x="Design", y="Score", ax=axes[1], alpha=0.8, cut=0, order=design_order, color=colors[2])
axes[1].set_xlabel("Design", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[1].set_ylabel("Score", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[1].tick_params(axis="x", rotation=0, labelsize=10)
axes[1].yaxis.set_minor_locator(plt.MultipleLocator(1))
axes[1].grid(which="minor", axis="y", linestyle=":", linewidth=0.5)
axes[1].set_ylim(bottom=0)

plt.text(-4.46, 31.3, 
         "Violin plots of material and design", 
         size = 14, weight = "bold", color = "black")
plt.text(-4.46, 29.5, 
         "Visualising the distribution of categorical variables against the condition score", 
         size = 13, color = "black")
plt.text(-4.46, -6, 
         "RASHAD MALIK" + " " * 140 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

From the violin plot, almost all the characteristics exhibit a wide range of condition scores, indicating variability in bridge condition within these categories. However, the "other" design category seems to have a more concentrated distribution, indicating more consistent condition scores.

The violin plots compare the categories against the total score, but not against each other. To test for collinearity, I used the chi-square test of independence. The chi-square test checks whether there is a significant association between two categorical variables, which can indicate potential collinearity.

  • Null hypothesis $(H_0)$: There is no association between Material and Design. This means the two variables are independent of each other.
  • Alternative hypothesis $(H_1)$: There is an association between Material and Design, indicating that the two variables are not independent.

The significance level ($\alpha$) is typically set at 0.05. If the p-value is less than 0.05, I can reject the null hypothesis and conclude that there is a statistically significant association between Material and Design.

In [20]:
# Contingency table of the two categorical variables
contingency_table = pd.crosstab(bridges_filtered["Material"], bridges_filtered["Design"])

# Chi-square test calculation
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2_stat:.4g}")
print(f"p-value: {p_value:.3g}")
Chi-square statistic: 898.1
p-value: 4.29e-193

Looking at the results, the following can be deduced:

  • A high chi-square statistic (like 898.10) shows a large discrepancy between observed and expected frequencies, suggesting that Material and Design may be associated.
  • The p-value is extremely small (essentially zero), far below the significance level of 0.05. This very low p-value indicates strong evidence against the null hypothesis, suggesting a statistically significant association between Material and Design. In other words, it's very unlikely that the observed association between Material and Design occurred by chance.

Since the p-value is far below the significance level of 0.05, the null hypothesis is rejected. This result suggests a statistically significant association between Material and Design, indicating that these two variables are not independent.

The significant association implies potential collinearity between Material and Design. This means that including both variables in the regression model could lead to issues, as one variable may partially predict the other.

I still included these variables in the regression analysis, however for future studies, I would recommend considering either using only one of the variables, or applying dimensionality reduction techniques to help mitigate these potential collinearity issues.

2.3 Preliminary conclusions¶

From the initial exploring of the predictor and target variables, a few early conclusions can be made:

  • There is a notable negative relationship between the age and condition score of the bridges. Older bridges tend to be in worse conditions.
  • There are no significant correlations among the age, daily average traffic, and truck percentage variables.
  • There is a statistically significant association between bridge material and bridge design, which may impact the reliability of the upcoming regression analysis.

Part 3: Regression modelling¶

3.1 Preparing variables for regression¶

Before starting the regression calculations, it is important to check the distributions of the continuous variables.

In [21]:
# Create a figure with 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(12.5, 8))

# Plot histogram for "Score"
axes[0, 0].hist(bridges_filtered["Score"], bins=20, color=colors[0])
axes[0, 0].set_xlabel("Score", fontsize = 12, fontweight = "bold", labelpad = 5)
axes[0, 0].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)

# Plot histogram for "Age"
axes[0, 1].hist(bridges_filtered["Age"], bins=20, color=colors[1])
axes[0, 1].set_xlabel("Age", fontsize = 12, fontweight = "bold", labelpad = 5)
axes[0, 1].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)

# Plot histogram for "AverageDaily" with logarithmic scale on y-axis
axes[1, 0].hist(bridges_filtered["AverageDaily"], bins=20, color=colors[2])
axes[1, 0].set_xlabel("Average daily traffic", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[1, 0].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)

# Plot histogram for "Trucks_percent" with logarithmic scale on y-axis
axes[1, 1].hist(bridges_filtered["Trucks_percent"], bins=20, color=colors[3])
axes[1, 1].set_xlabel("Truck percentage", fontsize = 12, fontweight = "bold", labelpad = 10)
axes[1, 1].set_ylabel("Frequency", fontsize = 12, fontweight = "bold", labelpad = 10)

plt.text(-155, 35000, 
         "Histograms (bins=20)", 
         size = 14, weight = "bold", color = "black")
plt.text(-155, 33700, 
         "Visualising the distribution of continuous variables", 
         size = 13, color = "black")
plt.text(-155, -5000, 
         "RASHAD MALIK" + " " * 140 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()
No description has been provided for this image

When predictors or the target variable are highly skewed, it can lead to problems with interpretability, model reliability, and predictive performance.

From the above histograms, the predictors AverageDaily and Trucks_percent are heavily skewed to the right.

To address the issues that may arise from skewed variables, it's often helpful to transform them (e.g. using logarithmic transformations) before fitting the regression model. This can help improve the model's performance.

I ran two regression models and compared the findings:

  • Model 1: Regression without transforming the skewed variables.
  • Model 2: Regression using a logarithmic transformation on AverageDaily and Trucks_percent.

Next, I prepared the categorical variables for regression by creating "dummy variables" for the Material and Design variables.

  • I dropped the most dominant type of material (concrete) to use it as the reference material that will be compared against.
  • Similarly, I dropped the most dominant bridge design (beam) to use it as the reference design that will be compared against.
In [22]:
# Creating dummy variables for Material and Design
material_d = pd.get_dummies(bridges_filtered.Material, drop_first=True)
design_d = pd.get_dummies(bridges_filtered.Design, drop_first=True)

3.2 Calculating regression coefficients¶

Now I calculated the regression coefficients (the $\beta$ values), which help estimate the effect the predictor variables have on the target variable.

Linear regression formula: $$ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 $$

In the above equation, $y$ represents the target variable Score. The $X$ values represent the seven predictors:

  • The three continuous variables Age, AverageDaily, and Trucks_percent.
  • For categorical values, dummy variables were created, so now there are 4 variables (one for each category): Material-Steel, Material-Other, Design-Slab, and Design-Other.

The $\beta$ values represent the regression coefficients, which indicate how strongly each variable influences the target variable.

First, I set up the variables for the regression analysis, for both Model 1 and Model 2, and ran the calculations:

In [23]:
# y vector: Vector of target values (Score) used in both model 1 and 2
y = bridges_filtered.Score

# Design Matrix X: Matrix of predictor values for model 1
X = np.column_stack((bridges_filtered.Age, bridges_filtered.AverageDaily, bridges_filtered.Trucks_percent,
                      material_d.Steel, material_d.Other,
                      design_d.Slab, design_d.Other))

# Design Matrix X_log: Matrix of predictor values with logarithmic transformations of skewed variables for model 2
X_log = X.copy()
X_log[:, 1] = np.log(X_log[:, 1] + 1)  # Transforming "AverageDaily"
X_log[:, 2] = np.log(X_log[:, 2] + 1)  # Transforming "Trucks_percent"

# Running the regression for model 1 and model 2
reg_model_1 = LinearRegression().fit(X, y)
reg_model_2 = LinearRegression().fit(X_log, y)

First, I compared key metrics from both models, to determine which model should be used going forwards.

I looked at:

  • The $R^2$ values: Also known as the coefficient of determination, used to assess the goodness of fit of the regression models. It indicates how well the model explains the variability in the target variable. Therefore, it is a useful metric for comparing the two models.
  • Root Mean Squared Error (RMSE): The RMSE allows evaluation and comparison of the accuracy of the regression models. It measures the average magnitude of the errors between predicted values and actual values, giving an indication of how well the models fit the data.
  • Coefficients ($\beta$ values): I considered how the logarithmic transformations affect the interpretation of the coefficients for AverageDaily and Trucks_percent.
In [24]:
# Running regression calculations
(beta_Age1, beta_AverageDaily1, beta_Trucks_percent1,
 beta_material_steel1, beta_material_other1,
 beta_design_slab1, beta_design_other1) = reg_model_1.coef_

(beta_Age2, beta_AverageDaily2, beta_Trucks_percent2,
 beta_material_steel2, beta_material_other2,
 beta_design_slab2, beta_design_other2) = reg_model_2.coef_

# Predictions and RMSE calculations for model 1 and model 2
y_pred_model_1 = reg_model_1.predict(X)
y_pred_model_2 = reg_model_2.predict(X_log)
rmse_model_1 = np.sqrt(mean_squared_error(y, y_pred_model_1))
rmse_model_2 = np.sqrt(mean_squared_error(y, y_pred_model_2))

# Printing relevant metrics for model comparison
print("Metrics for Model 1:")
print('Model 1: The R2 coefficient of determination is %4.3f' % reg_model_1.score(X, y))
print("Model 1: RMSE is %4.4f" % rmse_model_1)
print('Model 1: Estimated coefficient for average daily use is %4.11f' % (beta_AverageDaily1), 
      "change of total score per car.")
print('Model 1: Estimated coefficient for truck percentage is %4.5f' % beta_Trucks_percent1, 
      'change of total score per percent.')

print("\nMetrics for Model 2:")
print('Model 2: The R2 coefficient of determination is %4.3f' % reg_model_2.score(X_log, y))
print("Model 2: RMSE is %4.4f" % rmse_model_2)
print('Model 2: Estimated coefficient for average daily use is %4.4f' % beta_AverageDaily2, 
      'log units.')
print('Model 2: Estimated coefficient for truck percentage is %4.4f' % beta_Trucks_percent2, 
      'log units.')
Metrics for Model 1:
Model 1: The R2 coefficient of determination is 0.459
Model 1: RMSE is 1.4186
Model 1: Estimated coefficient for average daily use is 0.00000000469 change of total score per car.
Model 1: Estimated coefficient for truck percentage is 0.00568 change of total score per percent.

Metrics for Model 2:
Model 2: The R2 coefficient of determination is 0.461
Model 2: RMSE is 1.4160
Model 2: Estimated coefficient for average daily use is 0.0133 log units.
Model 2: Estimated coefficient for truck percentage is 0.0765 log units.

Based on the metrics provided for Model 1 and Model 2, here are some key points to consider when deciding between the two models:

Comparison of $R^2$ and RMSE values

  • Model 1: $R^2 = 0.459$, RMSE = 1.4186
  • Model 2: $R^2 = 0.461$, RMSE = 1.4160
  • The $R^2$ values are very similar, with Model 2 having only a slight improvement. This tells us that Model 2 explains a tiny bit more of the variance in Score, but the difference is so small that it may not be significant. The difference alone does not provide a strong reason to favour one model over the other.
  • The RMSE values for Model 1 and Model 2 are very close, with Model 1 having an RMSE of 1.419 and Model 2 an RMSE of 1.416. The slightly lower RMSE of Model 2 tells us that it provides a slightly better fit to the data. The log transformations in Model 2 might be capturing non-linear relationships that are not well-represented in the data used in Model 1.

Interpretability of coefficients

  • The estimated coefficients in Model 1 are very small, while the transformed values in Model 2 yield coefficients that are easier to interpret and suggest a more meaningful relationship between Score and the traffic variables (AverageDaily and Trucks_percent). This supports Model 2 as the preferred model, as the transformations seem to better capture the relationship between these predictors and the target variable.

Model 2 is the better choice because it slightly improves the $R^2$ value, has a marginally better RMSE value, and makes the coefficients for AverageDaily and Trucks_percent more interpretable. The log transformation helps capture meaningful effects, making Model 2 more suitable for both practical interpretation and predictive purposes, therefore Model 2 will be used moving forward.

The model 2 regression can be represented by the following formula:

$$ y = 22.8 + \beta_1Age + \beta_2log(AverageDaily) + \beta_3log(TrucksPercent) +\beta_4Material[Steel] + \beta_5Material[Other] + \beta_6Design[Slab] + \beta_7Design[Other] $$

Where:

  • $R^2$ value: $0.461$
  • Intercept: $22.8$
  • Coefficients:
    • $\beta_1 = -0.0593$
    • $\beta_2 = 0.0133$
    • $\beta_3 = 0.0765$
    • $\beta_4 = -1.39$
    • $\beta_5 = -2.78$
    • $\beta_6 = 0.0532$
    • $\beta_7 = 0.103$

3.3 Comparing regression coefficients¶

Below I compared the coefficients. For continuous variables, I calculated the effect the coefficients have across their 10th to 90th quantile ranges, as this allows comparison of the continuous variables directly without the different scaling affecting interpretation.

In [25]:
# Calculating ranges with log transformation where applicable
age_range = bridges_filtered.Age.quantile(0.9) - bridges_filtered.Age.quantile(0.1)
use_range = np.log(bridges_filtered.AverageDaily.quantile(0.9) + 1) - np.log(bridges_filtered.AverageDaily.quantile(0.1) + 1)
trucks_range = np.log(bridges_filtered.Trucks_percent.quantile(0.9) + 1) - np.log(bridges_filtered.Trucks_percent.quantile(0.1) + 1)
score_range = bridges_filtered.Score.quantile(0.9) - bridges_filtered.Score.quantile(0.1)

print("Change to the total score caused by continuous variables over 10th to 90th quantiles:")
print ('Age: %4.1f percent'
       % (100 * (beta_Age2 * age_range) / score_range))

print ('Average daily usage: %4.2f percent'
       % (100 * (beta_AverageDaily2 * use_range) / score_range))

print ('Truck percentage: %4.2f percent'
       % (100 * (beta_Trucks_percent2 * trucks_range) / score_range))

print("\nChange to the total score caused by categorical variables:")
print('Steel bridges: %4.2f' % beta_material_steel2, 
      'change of total score compared to bridges made of the reference material (concrete).')
print('Bridges made of "Other" materials: %4.2f' % beta_material_other2, 
      'change of total score compared to the reference material (concrete).')
print('Slab designed bridges: %4.4f' % beta_design_slab2, 
      'change of total score compared to the reference design (beam).')
print('Bridges of "Other" designs: %4.3f' % beta_design_other2, 
      'change of total score compared to the reference design (beam).')
Change to the total score caused by continuous variables over 10th to 90th quantiles:
Age: -61.7 percent
Average daily usage: 1.66 percent
Truck percentage: 4.80 percent

Change to the total score caused by categorical variables:
Steel bridges: -1.39 change of total score compared to bridges made of the reference material (concrete).
Bridges made of "Other" materials: -2.78 change of total score compared to the reference material (concrete).
Slab designed bridges: 0.0532 change of total score compared to the reference design (beam).
Bridges of "Other" designs: 0.103 change of total score compared to the reference design (beam).

Among the continuous variables, age showed the biggest impact, with a decrease in Score by 61.7% over its range, indicating that older bridges generally have lower condition scores. average daily usage and truck percentage had smaller effects, contributing 1.66% and 4.80% changes to the Score, respectively, indicating that these traffic variables have a relatively minor impact compared to age.

For categorical variables, steel bridges showed a decrease in Score by 1.39 points, and bridges made of "Other" materials showed a decrease by 2.78 points, both compared to concrete bridges (the reference material). Design differences had minimal effects, with slab designs contributing a 0.0532 point increase and Other designs a 0.103 point increase compared to beam designs.

The regression analysis highlights age as the most significant factor affecting bridge condition, with smaller but notable impacts from material types, and minor impacts from the traffic variables.

3.4 Distribution of residuals¶

It is important to analyse residuals, as it helps assess how well the model fits the data and whether the model assumptions are valid.

In [26]:
# Calculating y_hat
y_hat = reg_model_2.predict(X_log)

# Plotting residuals histogram
fig, a1 = plt.subplots(1, 1)
residuals = y_hat - y
a1.hist(residuals, bins=50, density=True)
_ = a1.set_xlabel('Error in prediction (predicted - actual total score)', fontweight = "bold", labelpad = 10)
plt.ylabel("Density", fontsize = 12, fontweight = "bold", labelpad = 10)

plt.text(-12.6, 0.38, 
         "Distribution of residuals (bins=50)", 
         size = 14, weight = "bold", color = "black")
plt.text(-12.6, 0.355, 
         "For regression model 2", 
         size = 13, color = "black")
plt.text(-12.6, -0.075, 
         "RASHAD MALIK" + " " * 32 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

plt.show()

# Calculating RMSE value
rmse = root_mean_squared_error(y, y_hat)
print("Root of the mean squared error: %.2f" % rmse)
No description has been provided for this image
Root of the mean squared error: 1.42

From looking at the distribution of residuals, the following can be noted:

  • The residuals are approximately centred around zero. Typically, in a well-fitted regression model, residuals should ideally have a mean close to zero, indicating that the model doesn't systematically over-predict or under-predict.

  • The histogram shows a distribution that is roughly bell-shaped, with the highest concentration of residuals clustered near zero and fewer residuals as we move away from zero. This suggests that the errors in the predictions are relatively small for most of the data points.

However, there are extreme residuals (above 10), which indicates that the model sometimes significantly under-predicts the Score.

  • Given the typical range of Score, a root of the mean squared error (RMSE) of 1.42 seems reasonably low, implying that the model performs fairly well.

3.5 Plotting predicted scores against actual scores¶

The scatter plot of predicted vs. actual values is a valuable diagnostic tool. It helps assess model fit, and detect patterns indicating model errors.

In [27]:
# Plotting outputs
fig, a = plt.subplots(1,1,figsize=(10,6))
a.scatter(y_hat, y,  color=colors[0], alpha=0.6)
a.plot(y_hat, y_hat, color='orange', linewidth=3)

plt.text(14.7, 30.8, 
         "Predicted scores against actual scores", 
         size = 14, weight = "bold", color = "black")
plt.text(14.7, 29, 
         "For regression model 2", 
         size = 13, color = "black")
plt.text(14.7, -6.5, 
         "RASHAD MALIK" + " " * 90 + "Source: Federal Highway Administration", 
         color = "#f0f0f0", 
         backgroundcolor = "#4d4d4d", 
         fontsize=12)

a.set_xlabel('Predicted score', fontsize = 12, fontweight = "bold", labelpad = 10)
a.set_ylabel('Actual score', fontsize = 12, fontweight = "bold", labelpad = 10)
plt.gca().yaxis.set_minor_locator(plt.MultipleLocator(1))
plt.grid(which="minor", axis="y", linestyle=":", linewidth=0.7)
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(0.2))
plt.grid(which="minor", axis="x", linestyle=":", linewidth=0.7)

plt.show()
No description has been provided for this image

Looking at the above plot, we can deduce the following:

  • The model performs reasonably well for mid-range scores (around 17 to 21), but tends to under-predict higher scores and over-predict some lower scores.
  • The presence of outliers (for example, many actual scores of close to 0 have high predicted scores) shows that the model may benefit from further refinement, possibly through transformations, additional predictors, or using a higher order regression (i.e. quadratic or above).

Overall, the plot suggests that while our model has reasonable predictive power, it has room for improvement, especially for extreme values.

Summary and conclusion¶

In this project, I successfully conducted an analysis of bridges in Texas, and explored the predictive power of specific variables on bridge conditions.

  • I manipulated the dataset in order to generate required variables, exclude bridges which may mislead the analysis, and derived a system to score the condition of the bridges.
  • I explored the relationship between the 5 predictors and the target variable. I identified the statistically significant association between bridge material and bridge design, which may have an impact on the analysis.
  • I successfully compared two regression models, and opted for Model 2 with the logarithmically transformed variables, as it provided a more practical interpretation of the regression coefficients.
  • I analysed the regression residuals and plotted the predicted scores against the actual scores. This gave a better understanding of how well the model performs, and provided insight into how this analysis could potentially be improved in future bridge studies.

Key findings:

  • Predictive performance:
    • The model performed well for mid-range scores, but tends to fall apart in extreme cases. It would under-predict higher scores and over-predict some lower scores.
    • For future studies, I would recommend refining the model by introducing additional variables and addressing correlated predictors (i.e. Material and Design).
  • Variable influence:
    • Age showed the biggest impact, with a decrease in the total score by 61.7% over its range, with smaller but notable impacts from material types, and minor impacts from the traffic variables.

In conclusion, the regression analysis provides valuable insights into the factors influencing bridge conditions, highlighting the significant impact of bridge age, while also underscoring areas for potential model refinement to improve predictive accuracy in future studies.

References¶

  1. Federal Highway Administration, U.S Department of Transportation, Accessed on: October 2024, https://highways.dot.gov/
  2. National Bridge Inventory, Federal Highway Administration, Accessed on: October 2024, https://www.fhwa.dot.gov/
  3. Detailed Code Mapping for Individual Data Items, Federal Highway Administration, Accessed on: October 2024, https://www.fhwa.dot.gov/