Se connecter
Date limite de participation :
10 août 2022

Ecole d'été : prévision de production d’éolienne

Ce challenge consiste à prévoir la quantité d’énergie éolienne produite par un seul parc sur plusieurs périodes de 48 heures

Classement
1. (2) Renaud de SOUZA Score 0,122042
2. (36) Perseverance HOUESSOU Score 0,122289
3. (1) Kévin KPAKPO Score 0,124355
Ce challenge est terminé.

742

contributions

103

participants

terminé
terminé
Welcome to the Data Challenge EEIA!

 

This article will allow you to take your first steps in the world of data science and give you all the tools to dive into the Wind Power Prediction data challenge. The objective is to predict the wind power of six wind farms in several 48 hours periods using wind characteristics, like wind speed and position, and also wind power forecasts.

We will give you the technical information allowing you to install your development environment. We will describe the main steps for a data scientist to approach a problem. Several links will be at your disposal to deepen the concepts related to this article. Moreover, a notebook allows you to develop your first model, it's up to you to improve it!

I. Setting up the environment

To get starter with the data challenge, you will have to choose a coding language. In data science, the most know are Python and R. You can use the one you prefer, but if you are new in this domain, we advise you to use Python.

To start coding, you will have to download a development environment. This will allow to create scripts, use packages and build your machine learning models. Among the environment, we can cite Jupyter Notebook, Pycharm, and  Spyder for Python and Rstudio for R. To have access to all of these, I advise you to download Anaconda. This famous open-source Data Science platform allows you to simplify the use of environment and packages.

Summary:

And also, some reference:

II. Exploratory Data Analysis

Now that you have set up your environment, you can use packages and write your scripts, and you are ready to (really) get started.

Different steps are necessary to get familiar with the dataset and provide successful submissions in a data challenge. First, you need to read and analyze the data: in most of the data challenges, the data set are not clean and can contain a lot of corrupted values or can have missing rows, columns, or samples. This first step is very important as it could raise issues when you apply a machine learning model, and it will greatly impact your results.

To do these analyses, you can either checks statistics describing each variable or you can also visualize your data with some libraries. This will give you visual insight that can be easily interpreted and understood. Visualization is also very important for a data scientist as it allows him to communicate his insights and results.

Data processing and transformation is often considered the most important part of a data challenge. Once you have well analyzed the dataset, you will have to take action: you can, for instance, correct the corrupted data, transform a column to add information and create new columns that will be useful in your study. This step is also very important because it will allow you to transform raw into informative and clean data, improving your submissions and climbing on the leaderboard.

III. Modeling
A. Splitting the dataset

Now that we are done manipulating our data, we must build a strategy to learn the relations between the variables and particularly how the wind power evolute with regards to its past values and the other variables.

Separating data into training and testing sets is important in evaluating machine learning models.


Why Splitting Training Dataset?

We split training data into training and a testing dataset, which gives us the opportunity to fairly evaluate our model’s performance without submitting results.

Typically, around 80% of the full training data is allocated to the training set and 20% is allocated to the testing dataset that is hold out for evaluation. In some cases, where we deal with models that have hyperparameters, we need to split the dataset into three part: 60% of training, 20% of validation and 20% of testing. Indeed, the training part allows our model to learn the optimal combinations of variables which fit at best the data, and we use the validation set to evaluate the performance of our model and tune hyperparameters during training. The idea behind splitting the training into training and validation is to avoid overfitting (meaning that our model becomes very good on training data because the model learns to fit exactly the points but do poorly on other data , we say that it didn’t generalize well).

Finally, we use the test set to have a fair evaluation of our model that is independent of the training process.

To sum up, we have a first part named training set, which aims to learn the dynamic of the time series. A second part is used to evaluate our model, avoid overfitting and fine-tune the hyperparameters. Once we have used these two sets, we need another independent one that will give us the model's accuracy. Cross validation is an approach that allow us to use our dataset better and better evaluate the model.

For more detail, here is a link.

B. Modeling

Now we can move on to the modeling of our machine learning model. There is a wide range variety of models that can be applied to tackle this data challenge. In this article and in this Jupyter Notebook, we will give you some idea about what can be done.

To deal with time series data, we can separate two main approaches:

  • The first approaches tackle the problem without considering the time dependence. In fact, each time step will be predicted based on its characteristics ie the values that takes the different variables that we have. This is the approach developed in the notebook. 

Any machine learning model can be used, from linear regression to deep learning. The advantage of this approach is that we can take advantage of the best models in machine learning. The limit of this approach is that it doesn’t consider the temporal aspect like trend and seasonality.

  • The second approach are time series models, which try to learn the time series as a whole and considers the temporal aspect. Many models exist, we can cite Autoregressive models like Arima or Exponential smoothing. Some packages are available to deal with time series. Among them, we can cite ProphetHere is great support for getting familiar with time-series data: Forecasting: Principles and Practice.

It is possible to combine these two approaches to get the advantage of both worlds. How?

Ensemble models are very popular in Kaggle competitions, and its idea is to combine different models and make a stronger one, “a supermodel,” to better predict.

C. Quantifying the results

Quantifying the data is very important to fully understand if the model is performing as expected. There are multiple ways to analyze the data by quantifying the results. In the following, we give you some metrics:

MAE - Mean Absolute Error: Mean absolute error is a measure of errors between paired observations expressing the same phenomenon

RMSE - Root Mean Square Error: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

R2: R2 is the proportion of the variance in the dependent variable that is predictable from the independent variable.

Dickey-Fuller Test: This test is used to analyze if the data set is stationary or non-stationary (If there are a trend or seasonal effects)

ACF/PACF - Autocorrelation Function / Partial Autocorrelation Function: Autocovariance is a function that gives the covariance of the process with itself at pairs of time points. Partial autocorrelations measure the linear dependence of one variable after removing the effect of other variable(s) that affect both variables.


With these quantitative analyses, it is also very important to visualize the results to understand how the trained data performed on the test data. By plotting your results, you will be able to compare and see where you can improve your models.  

 

Once we have found the perfect model for our data set using the methods defined, we begin forecasting by fitting new dates and features that were not used or seen in the training dataset.

 

IV. Submission

Once you have made your prediction, it is time to submit your results! For this, a part of the test on the platform is reserved to evaluate your model and to give you an indication, with the help of a metric, of the quality of your model and predictions.

 

Another part is private and is reserved for the final evaluation when the data challenge is over. The objective is to avoid participant to overfit on the public test set.

 

 

 

Real life applications: In business projects, we combine the forecasted data with training (actual) data to visualize the performance. Data visualization is very important in business as it is self-explanatory and can be presented and shared easily with end-users. Tools such as PowerBI or Tableau are very powerful tools that can properly help us visualize and help us analyze our forecasting period. 

Moreover, from a practical point of view, the prediction of electrical energy will allow to better respond to the need and to integrate wind energy in the energy mix.

 

Data challenges are an opportunity to move from theory to practice and to learn a lot. Please take advantage of it, and don't forget to be creative!

Le règlement du challenge peut être consulté ici :
https://bit.ly/3PoAqBP

Le vent est une source d'énergie renouvelable qui connaît une croissance rapide en Europe. Cependant, son développement rapide pose de nombreux défis au marché de l'électricité, et en particulier aux gestionnaires du réseau de transport qui doivent faire face à l'incertitude de production d'énergie éolienne lorsqu'ils prennent des décisions de planification et de livraison. Dans ce contexte, la prévision à court terme (jusqu'à 48 heures) est un outil essentiel pour gérer cette variabilité et exploiter plus efficacement les systèmes électriques intégrant une part importante d'énergie éolienne.

Dans ce défi, il vous sera demandé de prévoir la quantité d’énergie éolienne produite par un seul parc sur plusieurs périodes de 48 heures. Vous n'avez pas accès à la position de la ferme. Les données sont disponibles pour des périodes allant de la 1ère heure du 01/07/2009 à la 12e heure du 28/06/2012.

La phase de test est composée de 156 périodes de 48 heures entre le 01/01/2011 et le 28/06/2012, pour lesquelles vous aurez accès à la prévision des caractéristiques du vent, et vous devrez prédire la quantité d’énergie éolienne. Notez que chacune de ces périodes est séparée des périodes adjacentes par une fenêtre de 36 heures.

La phase d’entrainement du modèle est composée de deux éléments :

  • La force du vent et la puissance d’énergie éolienne produite du 01/07/2009 au 31/12/2010 (période initiale d’entrainement)
  • La force du vent et la puissance d’énergie éolienne produite sur des phases de 36 heures entre chacune des 156 périodes de test, sur lesquelles vous pouvez réentraîner vos modèles.

Notez que dans la première phase, nous sommes assez flexibles avec les contraintes opérationnelles. En effet, nous permettons que vous utilisiez toutes les données d’entrainement pour prédire n'importe quel point du test. Notez que dans la "vraie vie", cela ne serait pas possible car certains points des données d’entrainement sont dans le futur par rapport à certains points des données de test. Nous l'interdirons dans la deuxième phase du défi afin de nous rapprocher d'un scénario réel.

Pour vous aider à démarrer le développement de votre solution, vous trouverez ici un Notebook Jupyter mettant en œuvre toutes les étapes nécessaires pour soumettre une solution. Veuillez noter que vous devez soumettre un fichier csv avec " ;" comme séparateur et "." pour les décimales, comme indiqué dans le carnet de démo.

The training set contains 18,756 records with 4 features and 1 target variable to predict. Each record is a stamped measurement of the wind power, and 4 forecasted characteristics of the wind around the farm from.

The test set, for which you need to submit your solution, contains 7,488 records with the same four input features but no target.

“wp1” is the target variable. It is the normalized power generated by one wind farm “u”, “v”, “ws” and “wd” are forecasted characteristics of the wind around the farm:

  • “u” is the zonal wind component of the wind vector,
  • “v” is the meridional wind component of the wind vector,
  • “ws” is the wind speed, the L2 norm of the wind vector,
  • “wd” is the wind direction, the angle between the wind vector and the north.

Note that the components “u” and “v” are expressed with the wind vector azimuth convention. You can find more information here.

The performance of your model is evaluated using the Mean Absolute Error (MAE) between your prediction \(\hat{y_{i}}\) and the ground truth \(y_{i}\) over the \(N\) samples of the test set:$$ MAE = (\frac{1}{N})\sum_{i=1}^{N}\left | y_{i} - \hat{y_{i}} \right | $$The lower the MAE, the better your model is on the test set.

1. (2) Renaud de SOUZA 37 contributions 08/08/22 00:15 Score 0,122042
2. (36) Perseverance HOUESSOU 6 contributions 10/08/22 23:34 Score 0,122289
3. (1) Kévin KPAKPO 12 contributions 07/08/22 04:28 Score 0,124355
4. (6) Ousséni BIO KOUMAZAN 48 contributions 10/08/22 12:09 Score 0,124742
5. (4) Yanel 19 contributions 09/08/22 16:56 Score 0,125636
6. (3) Souléck Maoudé 40 contributions 08/08/22 10:39 Score 0,125772
7. (29) Josaphat Elonm AHOUANYE 8 contributions 09/08/22 22:44 Score 0,126918
8. (5) Michel 1 contribution 26/07/22 11:01 Score 0,130092
9. (10) Vivien OGOUN 35 contributions 11/08/22 00:04 Score 0,132747
10. (7) Joris GBENOU 14 contributions 04/08/22 17:53 Score 0,133002
11. Emery Patrice 3 contributions 08/08/22 17:04 Score 0,136476
12. (8) ADANLIENCLOUNON Précieux 41 contributions 09/08/22 16:18 Score 0,136851
13. (52) Fitahiana RAZAFIMAHENINA 13 contributions 09/08/22 22:22 Score 0,137303
14. Kéhat Tokannou 5 contributions 10/08/22 23:49 Score 0,137844
15. (9) AHOUANGAN Mickaël 26 contributions 04/08/22 10:16 Score 0,137965
16. (25) Ogbinto Samir Tafel BONI 9 contributions 10/08/22 21:48 Score 0,138147
17. (11) LONTCHEDJI Roméo 15 contributions 03/08/22 11:37 Score 0,138168
18. (47) Monsoï Consolas Névinas HODONOU 13 contributions 10/08/22 23:54 Score 0,138202
19. Freeda HOUETONSI 1 contribution 08/08/22 16:03 Score 0,138774
20. (12) Doriane ASSOGBA 6 contributions 06/08/22 20:18 Score 0,138805
21. CHITOU Abibou 9 contributions 10/08/22 17:06 Score 0,138810
22. (13) Adouke 2 contributions 29/07/22 19:05 Score 0,138886
23. (40) Somar 35 contributions 09/08/22 22:19 Score 0,138921
24. (14) MOÏSE NAYAGA 22 contributions 03/08/22 00:22 Score 0,138947
25. (15) Abakar Mallah 10 contributions 02/08/22 18:52 Score 0,138950
26. Wenda Zava-Niaina Rasoloharinjatovo 3 contributions 10/08/22 17:10 Score 0,138985
27. (16) Majorant Kougblenou 4 contributions 05/08/22 00:09 Score 0,138988
28. (17) Farid BONI 3 contributions 04/08/22 20:13 Score 0,139004
29. (18) Parfait Detchenou 8 contributions 26/07/22 14:17 Score 0,139005
30. (19) Charmant BALOGOUN 14 contributions 04/08/22 16:52 Score 0,139014
31. (20) Junior SOSSOU 1 contribution 31/07/22 17:35 Score 0,139029
32. (21) AZINNONGBE Hector Donatus 1 contribution 02/08/22 15:28 Score 0,139029
33. (22) TOURE 5 contributions 04/08/22 18:12 Score 0,139029
34. (23) Sergio Bossou 2 contributions 26/07/22 17:09 Score 0,139039
35. (24) Gabriel Medenou 8 contributions 04/08/22 23:47 Score 0,139045
36. (50) Audrey Kouessi FANGNON 11 contributions 08/08/22 21:52 Score 0,139137
37. (35) Fidèle DEGNI 6 contributions 10/08/22 01:32 Score 0,139148
38. (30) Crédo HOUNDOFI 6 contributions 09/08/22 19:57 Score 0,139225
39. (26) Emilio abata 7 contributions 02/08/22 23:48 Score 0,139243
40. (27) LAWANI Aduni 11 contributions 07/08/22 20:50 Score 0,139600
41. (49) Abdillahi Mohamed 4 contributions 08/08/22 23:38 Score 0,139602
42. (28) Sienka Dounia 6 contributions 09/08/22 23:35 Score 0,139602
43. (31) Fréjuste ABINONKPA 7 contributions 05/08/22 00:30 Score 0,140114
44. #D4RKS1D3 8 contributions 10/08/22 23:55 Score 0,140309
45. (32) Ayeda François DANIEL 5 contributions 07/08/22 19:37 Score 0,140351
46. (33) Arnaud Hounsou 3 contributions 05/08/22 00:15 Score 0,140369
47. (44) TRIPLE X 7 contributions 09/08/22 00:27 Score 0,140383
48. (34) Vincent Whannou de Dravo 1 contribution 05/08/22 21:42 Score 0,140509
49. (37) Fulgence Payot AKPONI 6 contributions 03/08/22 17:51 Score 0,142040
50. (38) Océane Hountondji 2 contributions 03/08/22 03:22 Score 0,142456
51. (39) LODJO oluwa-femi 12 contributions 06/08/22 01:07 Score 0,144431
52. (54) Wadoud adam 3 contributions 10/08/22 23:55 Score 0,145679
53. (41) Sidikoth ABIBOU 4 contributions 06/08/22 19:42 Score 0,147736
54. (42) Bénis de Dieu 1 contribution 05/08/22 20:22 Score 0,148091
55. (43) HOUNDADJO Jason Eraste 2 contributions 01/08/22 17:21 Score 0,148157
56. BOSSOU Landry 3 contributions 10/08/22 18:08 Score 0,148214
57. (45) KG KG 2 contributions 30/07/22 14:40 Score 0,148452
58. (46) Ariel ADANHO 1 contribution 02/08/22 21:36 Score 0,148563
59. Carmelle HOUNTONDJI 4 contributions 09/08/22 18:49 Score 0,148692
60. (48) Romuald kouelo 2 contributions 05/08/22 19:51 Score 0,149160
61. AMIDOU ATCHAMOU 1 contribution 09/08/22 17:32 Score 0,149658
62. KOUDOGBO König 1 contribution 10/08/22 21:01 Score 0,149882
63. (51) Françoise SETANGNI 2 contributions 04/08/22 17:37 Score 0,150217
64. (53) Adonis NOBIME 12 contributions 01/08/22 10:29 Score 0,170700
Discussions
loading... Chargement...