The Business Challenge
Our client, a mid-sized manufacturing plant, was experiencing an average of 12 equipment failures per month, resulting in approximately $250,000 in lost productivity annually. Traditional scheduled maintenance wasn't catching these issues early enough, while condition-based monitoring required expensive sensor installations.
We proposed a predictive maintenance system that would:
- Reduce unplanned downtime by at least 30%
- Extend equipment lifespan through timely interventions
- Optimize maintenance staff scheduling
- Provide actionable insights through a dashboard
Data Collection and Preparation
The first challenge was gathering sufficient historical data. We worked with:
data_sources = {
"SCADA_system": "5 years of operational data (temperature, vibration, pressure)",
"maintenance_logs": "Excel files with repair history",
"operator_logs": "Shift notes in PDF format",
"sensor_data": "Real-time IoT data from critical machines"
}
We used Pandas for data cleaning and feature engineering:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load and merge data
scada = pd.read_csv('scada_data.csv')
maintenance = pd.read_excel('maintenance_logs.xlsx')
# Create target variable (failure within next 7 days)
maintenance['failure_date'] = pd.to_datetime(maintenance['failure_date'])
scada['date'] = pd.to_datetime(scada['timestamp']).dt.date
scada = scada.merge(maintenance, how='left', on='machine_id')
# Feature engineering
scada['rolling_temp_avg'] = scada.groupby('machine_id')['temperature'].transform(
lambda x: x.rolling(window='24H').mean()
)
scada['vibration_change'] = scada.groupby('machine_id')['vibration'].diff()
# Normalize features
scaler = StandardScaler()
features = ['temperature', 'vibration', 'pressure', 'rolling_temp_avg', 'vibration_change']
scada[features] = scaler.fit_transform(scada[features])
Result
After preprocessing, we had a clean dataset with 27 meaningful features across 42 machines, covering 1.2 million data points over 5 years.
Model Development
We experimented with several approaches before settling on a hybrid model:
1. Random Forest Classifier (Baseline)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Prepare data
X = scada[features]
y = scada['failure_within_7_days']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
rf = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced')
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
2. LSTM Neural Network (Time Series Approach)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
# Reshape data for LSTM
X_lstm = scada.groupby('machine_id').apply(lambda x: x[features].values[-168:]) # Last 168 hours (1 week)
X_lstm = np.stack(X_lstm.values)
y_lstm = scada.groupby('machine_id')['failure_within_7_days'].max()
# Build model
model = Sequential([
LSTM(64, input_shape=(168, len(features)), return_sequences=True),
Dropout(0.2),
LSTM(32),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_lstm, y_lstm, epochs=20, batch_size=32, validation_split=0.2)
3. Ensemble Approach (Final Solution)
Our production model combined both approaches with a custom weighting system:
def predict_failure(machine_data):
"""Combine RF and LSTM predictions with business logic"""
rf_prob = rf.predict_proba(machine_data[features])[0][1]
lstm_prob = lstm_model.predict(prepare_lstm_input(machine_data))[0][0]
# Weighted average favoring LSTM for recent trends
combined_prob = 0.3 * rf_prob + 0.7 * lstm_prob
# Apply business rules
if machine_data['critical_machine']:
combined_prob *= 1.2 # More sensitive for critical machines
return combined_prob > 0.45 # Optimal threshold from ROC analysis
Model Performance
- Precision: 82% (When we predict failure, we're right 82% of the time)
- Recall: 78% (We catch 78% of actual failures)
- False Positive Rate: 9% (Acceptable for this use case)
Deployment Architecture
We implemented the solution using this tech stack:
architecture = {
"data_ingestion": "Apache Kafka for real-time sensor data",
"processing": "PySpark for large batch processing",
"model_serving": "FastAPI microservice with TensorFlow Serving",
"storage": "PostgreSQL for structured data + S3 for raw data",
"dashboard": "Plotly Dash with Celery for async updates",
"monitoring": "Prometheus + Grafana for system health",
"scheduling": "Airflow for maintenance task coordination"
}
Critical Implementation Details
Several key decisions made the system successful:
- Model Retraining: Weekly retraining with new data using Airflow DAGs
- Explainability: SHAP values for maintenance team transparency
- Fail-safes: Fallback to simpler statistical models if ML service fails
- Alerting: Slack integration for urgent predictions
Business Results
After 6 months in production:
Key Metrics Improvement
- 37% reduction in unplanned downtime (exceeded goal)
- 22% decrease in maintenance costs (better scheduling)
- 15% extension in equipment lifespan
- ROI: 4.2x (system paid for itself in 5 months)
Lessons Learned
This project taught us several valuable lessons about production ML:
- Data Quality > Model Complexity: Cleaning the maintenance logs took 40% of project time but provided the biggest accuracy gains
- Human-in-the-loop: Maintenance staff insights improved feature engineering
- Explainability Matters: Technicians needed to understand why predictions were made to trust the system
- Edge Cases: Had to handle sensor failures gracefully (common in industrial settings)
Next Steps
We're currently working on:
- Adding computer vision for visual inspection integration
- Implementing reinforcement learning for dynamic maintenance scheduling
- Porting the system to Rust for performance-critical components