Looking for Beta Testers for Weather Data Service

josephyang · August 2, 2020, 7:27am

Hello everyone,

For the last year or so, I’ve been working on making historical weather reanalysis data (e.g. ERA5 Reanalysis) more accessible. Despite the comprehensiveness and the quality of the data, it’s not widely used due to the effort required to process the data.

I’ve recently launched a service to access this data more easily and looking for beta testers. You can sign up at https://oikolab.com, creating an account and adding a subscription to ‘Starter’ to get access to historical reanalysis data for any place from 1980 onwards.

This enables one to quickly look at climate trends and develop tools such as:

Climate Explorer - how has climate changed where you live in the last 40 years?
EPW generator w/ Ladybug Python library - Chris created this Python script to demonstrate how you can generate EPW file for any location by just specifying the location data and the year and accessing the data. I modified the script a little for readability.

Please note that once fully launched, this will be a paid service, although the price will be substantially cheaper than currently existing services. I think it’s particularly concerning that as climate change progresses, it’s becoming difficult to even define a ‘typical meteorological year’ based on the past history so the idea is to enable energy modellers to easily generate 30~40 years worth of AMY as required for any location. For instance, the ‘Starter’ subscription that you can sign up for free at the moment would allow you to generate up to 100 AMY files per month.

I’m particularly interested in feedback with regards to the use case (how often do you use historical weather data or need to generate EPW?). If you have any questions, comments or suggestions, please feel free to reach out to me at joseph.yang@oikolab.com.

Thanks!

Joseph

seghier · August 2, 2020, 5:45pm

Thanks
File “C:/Users/user/oiko_make_epw.py”, line 57, in oiko_make_epw
location.time_zone = attributes[‘utc_offset’]
NameError: name ‘location’ is not defined

What is the problem, this happen also for
‘lat’: location.latitude,
‘lon’: location.longitude,

the problem is here must be location with Uppercase letter Location.latitude …

another problem

C:\Python36\python.exe C:/Users/user/oiko_make_epw.py
<html>
  <head>
    <title>Internal Server Error</title>
  </head>
  <body>
    <h1><p>Internal Server Error</p></h1>
    
  </body>
</html>

Process finished with exit code 0

josephyang · August 2, 2020, 11:26pm

Thanks for finding the error in the script - there was a missing line to define the ‘location’ parameter (my error, not Chris’). The script has been corrected and you should be able to try it again.

Joseph

SaeranVasanthakumar · August 12, 2020, 9:45am

Cool, I’m excited to see you finally release this @josephyang!

Can you discuss the spatial resolution of this dataset a bit more (or point me towards some ERA5 resources). I’m used to obtaining my weather data from airport sites, so I’m struggling to understand weather data interpolated from a 0.25 x 0.25 deg grid for any location.

Specifically:

How was the 0.25 x 0.25 gridded data obtained?
How reliable is the interpolated data that is in between the grid? Does it come with confidence intervals to indicate uncertainty of interpolated data?

And on a technical note, I’m curious about what you used to develop the visualizations in Climate Explorer website. Is that with a javascript library (i.e D3) or with one of those Python to javascript frameworks (i.e Dash)?

S

josephyang · August 12, 2020, 12:07pm

Thanks @SaeranVasanthakumar! (and great questions!)

Here’s a short (3 min) video on the ERA5 reanalysis. ‘Interpolation’ in this sense is ‘spatial (or bilinear) interpolation’ rather than ‘fill in the missing data’ type - so the value for any given point is derived from its adjacent four corner values.

To answer your questions:

0.25 x 0.25 gridded data is the format provided by ECMWF (in NetCDF or GRIB). If you imagine the atmosphere broken up into 3D grid it should give you an idea. The weather model doesn’t quite use this because the spacing is not uniform - this lat x lon grid is generated from Gaussian grid of various resolutions used by the atmospheric physics model. You can also get the data directly but it can be very slow to process.
ERA5 reanalysis data should be much better than airport data because it assimilates other types of observations (satellite, aircraft) to create a more consistent state estimation of the atmosphere. Because of the temporal & spatial resolution, it’s not quite suitable for abrupt event analysis such as storm and can be less accurate in mountainous regions. A higher resolution version (0.1 x 0.1) does slightly better for uneven terrains (I’ll have this later in the summer) but most EnergyPlus users shouldn’t notice much difference. Confidence interval is available but it’s not something that I’ve processed for API.

The visualization was done all in Python using Dash. It’s quite nice to work with although I think it’s a bit slower than Bokeh.

Hope this helps and let me know if you have more questions!
Joseph

josephyang · October 20, 2020, 1:19am

This is a note to announce that the beta-testing phase has ended for OikoLab.

Thanks everyone for those who’ve tried it out and for those who might still be interested, a ‘Starter’ plan with up to 300 API calls/month will remain free to try. Along with the new web-tool from Lukas (https://rokka.shinyapps.io/shinyweatherdata) and LB scripts to process NOAA files, it’s great to see that there are growing options for generating AMY files.

Thanks again!

SaeranVasanthakumar · November 13, 2020, 5:18am

@josephyang

Sorry for my late response here, I haven’t had the headspace to revisit this discussion until now - but thank you for your detailed response!

This sounds great, I really love the idea of being able to use multiple samples of spatial data to ensure a more robust representation of weather conditions.

Although, if being used in an energy simulation like you suggest here:

ERA5 reanalysis data should be much better than airport data because it assimilates other types of observations (satellite, aircraft) to create a more consistent state estimation of the atmosphere. Because of the temporal & spatial resolution, it’s not quite suitable for abrupt event analysis such as storm and can be less accurate in mountainous regions. A higher resolution version (0.1 x 0.1) does slightly better for uneven terrains (I’ll have this later in the summer) but most EnergyPlus users shouldn’t notice much difference.

I believe you would have to first apply the TMY weighted averaging method to 15 - 30 years of ERA5 data to ensure you’re getting an appropriately sampled year of weather over time.

If that can be done, I think ERA5 seems like a great solution for energy simulation, especially for sites that are outside or near the 30-50 km or 100 m elevation range from airport sites, that EnergyPlus recommends[1]. Even without that it’s still a better way to assess local conditions then using the sparse datasets from WeatherUnderground, which is how I’ve tried to assess local conditions in the past.

Finally regarding uncertainty quantification:

Confidence interval is available but it’s not something that I’ve processed for API.

Personally, I think this would be really useful, since the biggest concern going from a real, but a single source dataset (airport data TMY) to bilinearly interpolated dataset is trusting the inherent uncertainty in the latter’s approximation. You’ve already mentioned that it’s less accurate in mountainous regions, so we know the uncertainty isn’t uniform over the entire grid, and it’d be much better to have a way to quantifiably check this, rather then rely on rules of thumbs. (More broadly, I just think the building science discipline would benefit from better acknowledging the inherent stochasticity/uncertainty in measurements and the way that is propagated forwards in our simulations.)

[1] https://energyplus.net/weather/simulation

josephyang · November 14, 2020, 1:46am

Hi @SaeranVasanthakumar, no worries and thanks for the reply.

It’s possible to get up to 40 years of data in one go so it’s fairly straightforward to create an updated TMY file if needed. I’ve started processing uncertainty values although one needs to really understand what it means in this context.

On the other hand, to consider airport data as ‘real’ and reanalysis data as ‘simulated’ or ‘approximate’ is a bit of a mischaracterization I think. One way to think of it is to consider the room temperature as reported by your thermostat vs. another that takes in the thermostat reported temperature, plus a thermal imaging of the room, the location of the thermostat in the room in relation to the window, and the laws of thermodynamics etc. There is much more math involved in the latter processing but it doesn’t make it anymore unreal. NREL also recommends using gridded data over point-source data, probably because deriving radiation parameters from airport-reported sky coverage is really quite problematic (as I discussed here).

But in any case, if the models are not calibrated, it really just comes down to a matter of convenience to use the weather data that’s sufficiently correct to drive design decisions. As I like to quote, ‘all models are wrong, but some are useful.’

SaeranVasanthakumar · November 14, 2020, 3:34am

Thanks @josephyang for going into the subtleties of reanalysis data. Just to make sure I understand this completely, when you say the following:

On the other hand, to consider airport data as ‘real’ and reanalysis data as ‘simulated’ or ‘approximate’ is a bit of a mischaracterization I think. One way to think of it is to consider the room temperature as reported by your thermostat vs. another that takes in the thermostat reported temperature, plus a thermal imaging of the room, the location of the thermostat in the room in relation to the window, and the laws of thermodynamics etc. There is much more math involved in the latter processing but it doesn’t make it anymore unreal. NREL also recommends using gridded data over point-source data, probably because deriving radiation parameters from airport-reported sky coverage is really quite problematic (as I discussed here).

Is it also a mischaracterization to think of the bilinearly interpolated data between the gridded data as an ‘approximation’? I’m thinking that the bilinear interpolation process between grid points, in a literal sense, is just the approximation of intermediate values based on some arbitrary function, and can give different results if it’s linear or some wacky high order polynomial:

We can see the bilinear analog here, with more smoothing for higher order polynomials:

That’s what I’m thinking of as the approximated value, since it’s essentially a line of best fit - not actual data. And unlike the real data source (i.e airports) the uncertainty associated with this dataset is not just from the measurement, but also uncertainty related to which function is chosen and how well it represents weather at that given location.

On the other hand, I can see how my reasoning can be wrong if there is in fact actual, continuous data available (i.e your thermal image example) that is used to fill in these gaps, and thus it’s not a naive or crude interpolation as I think it is.

josephyang · November 14, 2020, 6:17am

It’s actually much more sophisticated than that and it’s quite different than the line of best fit. Imagine the atmosphere as fluid, with continually varying parameters across 3D space and time. Atmospheric physics dictate the limits of change (i.e. gradient) and the limits of the values along any of these axis so the measurement/sampling noise can be filtered out.

For example in the figure below, which way would you say the wind is blowing?

So it’s not so much a ‘fill in the gap with airport sensor data as ground truth’ but more of starting with the laws of physics and treating all measurements as just that - measurements.

Perhaps as another example, imagine 3D scanning a sphere - we can generate arbitrary amount of data points with arbitrary resolutions across x, y, and z dimensions but knowing beforehand that it’s a sphere is really the most important piece of knowledge we have about that object. Similar to how x^2 + y^2 + z^2 = r^2 is the mathematical representation of our understanding of a sphere, our understanding of the atmospheric process is captured in numerical weather models, which are also used to generate these data with the benefit of hindsight.

SaeranVasanthakumar · November 14, 2020, 8:34am

Got it! This quote in particular clears up the high-level intuition for me:

So it’s not so much a ‘fill in the gap with airport sensor data as ground truth’ but more of starting with the laws of physics and treating all measurements as just that - measurements.

josephyang · November 14, 2020, 9:43am

Great to hear it helped to clarify things a little!

josephyang · December 4, 2020, 3:48am

It seems like energy modelling community in general isn’t too familiar with meteorological methods so here’s a simple overview to browse through - https://www.ecmwf.int/assets/elearning/da/da1/story_html5.html.

This goes through quite well how we can do much better than equating airport observations as ground truth using methods such as Ensemble Kalman Filter and Data Assimilation.

LelandCurtis · December 28, 2020, 9:12pm

@josephyang, If I understand the conversation correctly, you are suggesting we replace “measured” tmy data from local weather stations with “modeled” AMY data derived from global models?

I’ve noticed consumer weather apps like DarkSky have improved dramatically over the last few years, which makes me think the underlying weather models have improved dramatically as well. Are we at a place where these models are better for microclimate data than taking nearby measured data? I assume models will be better for locations where local interference from geographic features like bodies of water, mountains, etc. is strong, but that those locations that are similar to the weather stations would be better served with traditional measured tmy data.

Are you aware of any research discussing this topic?

josephyang · December 29, 2020, 7:12am

Hi @LelandCurtis, thanks for the question.

I think one way to look at it is that they’re actually both mostly modeled - so instead of ‘measured’ (ground truth) vs. ‘modeled’ (synthetic) but rather ‘sampled or empirically modeled’ vs. ‘best state estimation in terms of meteorological science’. For instance if we take solar/thermal radiation parameters, the cloud coverage values used to calculate these have been somewhat repurposed from the original intent (e.g. cloud ceilings reported are not the same as total cloud coverage), whereas NWP models even take into account things like aerosol and various atmospheric absorption spectrum. The former is mostly a regression model and the latter a physics-based one that also removes spatial and temporal sampling noise.

In the field of renewable energy modelling & forecast (e.g. solar & wind), I think it would be hard to find anyone who uses weather station data as a source and NREL also recommends using gridded data (their NSRDB is based on MERRA2 reanalysis data) over station data. The underlying NWP models are indeed getting better but apps such as DarkSky don’t run physics-based model and instead run ML-derived enhancements over other NWP models (e.g. GFS) using public radar dataset. Project such as NASA Power have also been created with the recognition of the limitation of station-based data.

I do also recommend reanalysis data over station-based data in terms of data integrity and physics-consistency, especially if any sort of calibration is being done but this is not to say that existing TMY files are grossly incorrect - for most use cases, I think it’s fine either way. I’ve been discussing with Dru and Linda about updating some of the TMY data with reanalysis-derived dataset but I’m not too sure what other efforts there are.

Hope this helps!

josephyang · February 23, 2021, 8:45am

Some update on this for those who may be interested. OikoLab is collaborating with Climate One Building this year for a major update of their weather files by using ERA5 parameters.

This addresses one of the weakness of cloud-cover-derived radiation values as discussed in this thread and above discussions, as well as using the latest weather data (up to end of 2020) to generate TMY files. We’re hoping that this will provide more consistency and transparency especially for locations outside of North America, and allow building-energy simulations to be performed using updated weather files that reflect the changing climate.

Thanks and please feel free to reach out for any questions!

seghier · November 1, 2021, 1:02am

Hello @josephyang
Today i try the script to create epw but didn’t work,
first i got this error:

    location.time_zone = attributes['utc_offset']
KeyError: 'utc_offset'

I set location.time_zone = 1.0
but i got another error:

File "C:\Python36\lib\site-packages\ladybug\skymodel.py", line 291, in estimate_illuminance_from_irradiance
    w = math.exp(0.08 * dew_point - 0.075)
TypeError: can't multiply sequence by non-int of type 'float'

josephyang · November 1, 2021, 6:21am

Hi @seghier - the API now has ability to query multiple locations and the data return format has been updated slightly since the initial release. Please see - OikoWeather API Reference

I can’t comment on the skymodel error as that is part of the Ladybug code.

seghier · November 1, 2021, 9:50am

Thanks @josephyang
I try the example in the website but how to export the result as epw?

import requests
import json
import pandas as pd

r = requests.get('https://api.oikolab.com/weather',
                 params={'param': ['temperature','wind_speed'],
                         'start': '2010-01-01',
                         'end': '2018-12-31',
                         'lat': 43.6529, 
                         'lon': -79.3849,
                         'api-key': 'your_api_key_here'}
                 )

weather_data = json.loads(r.json()['data'])
df = pd.DataFrame(index=pd.to_datetime(weather_data['index'], 
                                       unit='s'),
                  data=weather_data['data'],
                  columns=weather_data['columns'])

df.to_csv(path_or_buf='data.csv')