Optimizing weather data storage for MachineLearning with Ladybug

Over the weekend I setup a script to scrape through the entire set of EPW files that exist for all weather stations on earth. Much of the script was derivative of the “hacking” that we did during the AEC Hackathon. I ended with up 4800+ epw files and around 8GB of data. My aim is to use this dataset, along with the API for Ladybug, while I work through a couple of Machine Learning texts.
However, right now, parsing and traversing through the entire (or even partial) dataset is an issue because of its size. So, my question is: Has anyone tried storing or serializing EPW data in relational databases of some sort?

(PS @mostapha, @TheodorosGalanos : Is anyone using Ladybug API for Machine Learning at all?)


@sarith ,

I know @mostapha and I have scripts that do the same epw scrape. We’ve both written scripts to pull out relevant pieces of data for each EPW. For example, @mostapha uses it to make the outdoor comfort info that displays in epwmap. I’ve used it to build a lookup of heating design day temperatures for the winter thermal comfort tool we developed at my office.

I haven’t created a database, though, and I don’t think @mostapha has either.

Actually Mostapha and I had a discussion about this off-forum. It takes around 2 minutes to instantiate the entire 8GB data set into EPW classes in Ladybug. This is purely a one time process, however, as after that one can rely on the OOP structure in Python for data-retrieval.

Wow, that’s great to hear! OOP is quite a beautiful thing :slight_smile:

@sarith, I’m curious, once you create your EPW classes, can you pickle the objects so you don’t have to do the 2 minute data -> EPW obj conversion everytime?


@SaeranVasanthakumar Pickling can be done, but is a pain to unpickle (load) classes which are not created using Python’s standard library (in this case EPW). Relational databases are probably a better idea. In the case of EPWs they can also be easily stored in hash tables as the StationID is unique to majority EPWs.

1 Like

@sarith that’s true, I’ve also had some trouble pickling/unpickling complex objects. And the relational database will most likely give you the most flexibility and efficiency.

Hi @sarith

I apologize for such a late reply! This is fascinating work, I would love to have access to that dataset if you are willing to share. I would be especially interested in a database that relates specific geographical location and environmental conditions. Let me know if you achieved something like this. If not, I wouldn’t mind working on it as it could pair well with an urban research project I’m currently involved in. Let me know if it sounds interesting to you.

Concerning ML and Ladybug Tools, I have done quite a few studies with that set up and it works quite well I think. In my case, the whole process is decoupled. I do the parametric stuff with LBT and GH, along with a bit of Python preprocessing, and do the rest outside GH for the ML and DL stuff. I’d be glad to discuss experiences on that if you want.

Kind regards,