Synthetic Data is data artificially created in contrast to data collected from real-world events. The goal is to provide a fast and cheap alternative to real-world data to create accurate Artificial Intelligent models. For this study, an incident solar radiation simulation is executed automatically over different building design options to generate 2,000 data points to train a supervised model. The objective is to calculate the incident solar radiation of a building without simulation software.
Synthetic data is a good alternative when privacy matters, when the data required does not exist, when the costs associated are expensive, etc. For this case, it is relevant because the data types can be adjusted, added, or removed from the dataset until an incident solar radiation prediction accuracy of 85%+ is achieved.
Rhinoceros and Grasshopper are used for the parametric model, Ladybug is used to simulate the incident solar radiation, and the Laga library generates the geometry and collects the information to create the data.
workflow used to generate the data from the simulations
Incident solar radiation
The incident solar radiation is the power per unit received from the sun and is measured in kilowatt-hours per square meter (KWh/m²). The solar radiation can be transformed into heat and electricity, among other forms of energy. Hence, the importance of understanding this information. According to Ladybug's description, the incident solar radiation is useful to evaluate the impact of a building's orientation on both energy use and the size/cost of the cooling systems. The incident solar radiation depends on the geographic location, time of day, season, local landscape, and local weather.
Exploratory data analysis
Exploratory Data Analysis (EDA) is used to understand the characteristics of a dataset and the relationships between the variables. There are different tools and techniques to perform EDA, but the most common is data visualization. During the exploration, it is possible to recognize the shape of the data, discover patterns, outliers, and differences. This allows one to ask well-formulated research questions. The following images are part of the EDA from the synthetic data generated.
Number of occurences
The distribution volume is positively skewed, which means the probability distribution is uneven and asymmetric. The volume variable represents the building volume, which is the result of the different parameters generated randomly: Width, Length, XY, and ZU. The distribution is positive because the tail of the curve is longer on the right side when compared to the left side. This means the right side shows the wider extension of data points, in this case, buildings generated randomly.
The distribution total graph also shows a positive skew, which might present a correlation with the volume parameter.
The distribution angle shows something closer to a uniform distribution.
scatterplot, relationship between two sets of data
The large graph shows a clear positive linear correlation between the volume of the building and the total incident solar radiation. Generally, the bigger the volume, the bigger the total incident solar radiation.
The three small graphs on the top right show the relationships between Width and Total, Length and Total, and Angle and Total. The data points are spread in these graphs. This means there is no trend in the data, and thus there is very low or no correlation.
The three small graphs on the bottom left show the relationships between Width and Volume, Length and Volume, and Angle and Volume. The first two graphs show a positive and low correlation. The last graph has gaps in values.
Correlation matrix
The correlation matrix shows the correlation between all the possible pairs of values. The graphic is symmetrical in its diagonal. The matrix is useful to summarize data, identify patterns, and understand which variable is more correlated with another variable. Every cell displays the correlation coefficient, which ranges from -1 to 1. A coefficient of 0 means no correlation.
The matrix above shows strong correlations between variables: volume and total (observed before), ZU and total, and YX and total. These are marked with a black rounded rectangle. Other correlations include width and volume and length and volume.
Conclusions
This article shows the analysis of synthetic data generated with Rhino, Ladybug, and Laga. The data generated will be used to create a supervised model to predict the total incidence of solar radiation from the volume and other variables used to generate the building shape. This article focuses on data instead of architectural design or urban context.
The image below shows the geometry test bed used for the different simulations. In wireframe visualization, the context is shown. In color, one of the geometries generated inside Rhino is simulated with Ladybug. The data was collected and saved using the Laga library.
REFERENCES
Gov.UK Review techniques to create synthetic datasets