Building a Predictive Maintenance System with Python

The Business Challenge

Our client, a mid-sized manufacturing plant, was experiencing an average of 12 equipment failures per month, resulting in approximately $250,000 in lost productivity annually. Traditional scheduled maintenance wasn't catching these issues early enough, while condition-based monitoring required expensive sensor installations.

We proposed a predictive maintenance system that would:

Reduce unplanned downtime by at least 30%
Extend equipment lifespan through timely interventions
Optimize maintenance staff scheduling
Provide actionable insights through a dashboard

Data Collection and Preparation

The first challenge was gathering sufficient historical data. We worked with:

Data Sources

data_sources = {
    "SCADA_system": "5 years of operational data (temperature, vibration, pressure)",
    "maintenance_logs": "Excel files with repair history",
    "operator_logs": "Shift notes in PDF format",
    "sensor_data": "Real-time IoT data from critical machines"
}

We used Pandas for data cleaning and feature engineering:

Python: Data Cleaning

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load and merge data
scada = pd.read_csv('scada_data.csv')
maintenance = pd.read_excel('maintenance_logs.xlsx')

# Create target variable (failure within next 7 days)
maintenance['failure_date'] = pd.to_datetime(maintenance['failure_date'])
scada['date'] = pd.to_datetime(scada['timestamp']).dt.date
scada = scada.merge(maintenance, how='left', on='machine_id')

# Feature engineering
scada['rolling_temp_avg'] = scada.groupby('machine_id')['temperature'].transform(
    lambda x: x.rolling(window='24H').mean()
)
scada['vibration_change'] = scada.groupby('machine_id')['vibration'].diff()

# Normalize features
scaler = StandardScaler()
features = ['temperature', 'vibration', 'pressure', 'rolling_temp_avg', 'vibration_change']
scada[features] = scaler.fit_transform(scada[features])

Result

After preprocessing, we had a clean dataset with 27 meaningful features across 42 machines, covering 1.2 million data points over 5 years.

Model Development

We experimented with several approaches before settling on a hybrid model:

1. Random Forest Classifier (Baseline)

Python: Random Forest Model

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Prepare data
X = scada[features]
y = scada['failure_within_7_days']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
rf = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced')
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

2. LSTM Neural Network (Time Series Approach)

Python: LSTM Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Reshape data for LSTM
X_lstm = scada.groupby('machine_id').apply(lambda x: x[features].values[-168:])  # Last 168 hours (1 week)
X_lstm = np.stack(X_lstm.values)
y_lstm = scada.groupby('machine_id')['failure_within_7_days'].max()

# Build model
model = Sequential([
    LSTM(64, input_shape=(168, len(features)), return_sequences=True),
    Dropout(0.2),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_lstm, y_lstm, epochs=20, batch_size=32, validation_split=0.2)

3. Ensemble Approach (Final Solution)

Our production model combined both approaches with a custom weighting system:

Python: Ensemble Prediction

def predict_failure(machine_data):
    """Combine RF and LSTM predictions with business logic"""
    rf_prob = rf.predict_proba(machine_data[features])[0][1]
    lstm_prob = lstm_model.predict(prepare_lstm_input(machine_data))[0][0]
    
    # Weighted average favoring LSTM for recent trends
    combined_prob = 0.3 * rf_prob + 0.7 * lstm_prob
    
    # Apply business rules
    if machine_data['critical_machine']:
        combined_prob *= 1.2  # More sensitive for critical machines
        
    return combined_prob > 0.45  # Optimal threshold from ROC analysis

Model Performance

Precision: 82% (When we predict failure, we're right 82% of the time)
Recall: 78% (We catch 78% of actual failures)
False Positive Rate: 9% (Acceptable for this use case)

Deployment Architecture

We implemented the solution using this tech stack:

System Architecture

architecture = {
    "data_ingestion": "Apache Kafka for real-time sensor data",
    "processing": "PySpark for large batch processing",
    "model_serving": "FastAPI microservice with TensorFlow Serving",
    "storage": "PostgreSQL for structured data + S3 for raw data",
    "dashboard": "Plotly Dash with Celery for async updates",
    "monitoring": "Prometheus + Grafana for system health",
    "scheduling": "Airflow for maintenance task coordination"
}

Critical Implementation Details

Several key decisions made the system successful:

Model Retraining: Weekly retraining with new data using Airflow DAGs
Explainability: SHAP values for maintenance team transparency
Fail-safes: Fallback to simpler statistical models if ML service fails
Alerting: Slack integration for urgent predictions

Business Results

After 6 months in production:

Key Metrics Improvement

37% reduction in unplanned downtime (exceeded goal)
22% decrease in maintenance costs (better scheduling)
15% extension in equipment lifespan
ROI: 4.2x (system paid for itself in 5 months)

Lessons Learned

This project taught us several valuable lessons about production ML:

Data Quality > Model Complexity: Cleaning the maintenance logs took 40% of project time but provided the biggest accuracy gains
Human-in-the-loop: Maintenance staff insights improved feature engineering
Explainability Matters: Technicians needed to understand why predictions were made to trust the system
Edge Cases: Had to handle sensor failures gracefully (common in industrial settings)

Next Steps

We're currently working on:

Adding computer vision for visual inspection integration
Implementing reinforcement learning for dynamic maintenance scheduling
Porting the system to Rust for performance-critical components

Python Developer

Full stack Python developer specializing in industrial AI applications. 5+ years of experience bringing machine learning solutions to manufacturing and logistics.