The use of new synthetic data has significantly improved the performance of our Machine Learning (ML) model in predicting solar radiation. In the previous test, the mean distance between the predicted and actual measurements was ~23.70. After incorporating new synthetic data and adjusting the feature set, this mean distance dropped to ~8.25, resulting in a more accurate prediction.
One of the key advantages of synthetic data is its flexibility: it can be tailored to the specific problem at hand. For the updated experiment, the features ZU
, YX
, width
, and length
were replaced with Area1
, Area2
, Area3
, Area4
, and Area5
, which represent the facade areas of the building. Additionally, the angles between the facade surfaces and the project North were introduced as new features (ang1
, ang2
, ang3
, ang4
, ang5
).
A powerful tool for understanding the relationships between variables in the dataset is the correlation matrix, which measures the strength and direction of linear relationships between pairs of variables. The Pearson correlation coefficient quantifies these relationships, where:
- -1 indicates a perfect negative linear correlation,
- 0 indicates no linear correlation,
- 1 indicates a perfect positive linear correlation.
In general, the closer the coefficient is to 1 or -1, the stronger the relationship between the two variables.
Correlation matrix comparison
The left matrix shows the correlation relationships from the earlier dataset, while the right matrix represents the updated dataset.
Previous Dataset: The prediction of the total solar radiation target was mainly influenced by 3 features: YX
, ZU
, and volume
. Other features, such as angle
, length
, and width
, had a relatively limited influence. Additionally, several other variables (seen as white cells in the matrix) did not participate in the prediction.
Updated Dataset: The revised dataset, now incorporating the new features (Area1
, Area2
, Area3
, Area4
, Area5
), shifted the importance towards these facade areas and volume. In contrast, the angular features had a minor impact, though their correlation coefficient increased slightly from 0.043 in the previous test to 0.056 in the updated version, suggesting a higher, though still small, influence.
Key Insight: The new synthetic data introduced additional useful features and optimized the feature weight distribution, enhancing the model's prediction accuracy.
Results
The images below demonstrate the accuracy improvements between the two tests. The X-axis represents the different independent tests (from A to K in the previous test and from A to M in the updated test), while the Y-axis shows the total solar radiation values. The diamonds represent the real measurements, and the blue circles represent the predictions made by the ML model.
- In the first test (green circles), the mean distance between predicted and actual values was 23.70.
- In the revised test (blue circles), the mean distance reduced to 8.25, indicating a substantial improvement in the model’s accuracy.
Notably, the ML model itself did not change between the two tests; only the training data had been updated. This demonstrates the power of better-quality data to improve the performance of an existing model.
Conclusions
The incorporation of new synthetic data not only added valuable features for predicting solar radiation but also optimized the way these features were weighted within the model. Despite the model remaining unchanged, the prediction accuracy was significantly improved by refining the data used for training. This confirms the critical role that well-constructed synthetic data can play in enhancing the performance of Machine Learning models for environmental simulations.