Synthetic data with Rhino-Grasshopper, Ladybug and Laga library

Synthetic Data is data artificially created in contrast of data collected from real world events. The goal is to provide a fast and cheap alternative to real world data to create accurate Artificial Intelligent models. For this study an incident solar radiation simulation is executed automatically over different building design options to generate 2,000 data points to train a supervised model. The objective is to calculate the incident solar radiation of a building without a simulation software.

Synthetic data is a good alternative when privacy matters, when the data required do not exist, when the cost associated are expensive, etc. For this case is relevant, because thedata types can be adjusted, added or removed from the dataset, until an incident solar radiation prediction accuracy is achieved 85%+.

Rhinoceros and Grasshopper are used for the parametric model, Ladybug is used to simulate the incident solar radiation and Laga library to generates the geometry and collect the information to create the data.

Workflow workflow used to generate the data from the simulations


The incident solar radiation is the power per units received from the sun and is measured in kilowats hour per square meters (KWh/m2). The solar radiation can be transformed in heat and electricity, among other forms of energy, Hence the importance to understand this information. According to Ladybug description, the incident solar radiation is useful to evaluate the impact of a building's orientation on both energy use and the size/cost of the cooling systems. The incident solar radiation depends on the geographic location, time of day, season, local landscape and local weather.


Exploratory Data Analysis (EDA) is used to understand the characteristics of a dataset and the relationships between the variables. There are different tools and techniques to performs EDA, but the most common is data visualization. During the exploration is possible to recognize the shape of the data, discover patterns, outliers, and differences. This allows to ask well-formulated research questions. The following images are part of the EDA from the synthetic data generated.

Number of occurences

Number of occurrences The Distribution volume is positive skewed, which means the probability distribution is uneven and asymmetric. The volume variable represents the building volume which is the result of the different parameters generated randomly: Width, Length, XY and ZU. The distribution is positive because the tail of the curve is longer on the right side when compared to the left side. This means the right side shows the wider extension of data points, in this case buildings generated randomly.

The Distribution total graph it also shows a positive skewed, which might present a correlation with volume parameter.

The Distribution angle shows something closer to a uniform distribution.

scatterplot, relationship between two sets of data

scatterplot, relationship between two sets of data The big graph shows a clear positive linear correlation between the volume of the building and the total incident solar radiation. Generally, the bigger the volume, the bigger the total incident solar radiation.

The three small graphs on the top right, shows the relations between Width and Total, Length and Total, and Angle and Total. The data points are spread in these graphs. This means there is no trend to the data and thus there is a very low or no correlation.

The three small graphs on the bottom left, shows the relations between Width and Volume, Length and Volume, and Angle and Volume. The first 2 graphs show a positive and low correlation. The last graph hast gaps in values.

Correlation Matrix

Correlation Matrix The correlation matrix shows the correlation between all the possible pairs of values. The graphic is symmetrical in their diagonal. The matrix is useful to summarize data, identify patterns and understand which variable is more correlated with another variable. Every cell displays the correlation coefficient which ranges from -1 to 1. A coefficient of 0 means no correlation.

The matrix above shows a strong correlations between variables volume and total (observed before), ZU and total and YX and total. These are marked with a black rounded rectangle. Others correlations are width and volume and length and volume.


The article shows the analysis of synthetic data generated with Rhino, Ladybug and Laga. The data generated will be used to create a supervised model to predict the total incidence of solar radiation from the volume and the other variables used to generate the building shape. The article pivot around data instead of architectural design or urban context.

The image below shows the geometry test bed used for the different simulations. In wireframe visualization the context. In colour one of the geometries generated inside Rhino, simulated with Ladybug. The data was collected and saved using Laga library.

Context Context


What is synthetic data?

Gov.UK Review techniques to create synthetic datasets

Synthetic Data Statistics: Benefits, Vendors, Market Size

Synthetic Data Statistics: Benefits, Vendors, Market Size


Add a comment