{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All right, we hope that was a sufficiently grandiose introduction. Now it's time to get our hands dirty and work through an example. We'll start very simple and throughout the book delve deeper. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook loads up some actual measurements of the sun over time, cleans the data, and then uses machine learning techniques to fit those data. \n", "\n", "Specifically for the data, the observations are from the Solar Dynamics Observatory (SDO) Extreme Ultraviolet Variability Experiment (EVE) that measures all the light coming from the sun between 5 nanometers (nm) and 105 nm. We'll be working just with the measurements taken at 17.1 nm; light that is emitted from the Iron (Fe) IX ion in the corona, which only exists at a temperature of about 600,000 K -- a fairly moderate temperature for the solar corona. \n", "\n", "Specifically for the machine learning, we'll be using Support Vector Regression (SVR) and validation curves. Support Vector Machines (SVM) are typically used in a type of **classification**, an important category of machine learning that focuses on identifying and labeling groups in the data. SVMs can be extended to regression. There's some discussion of the function we'll be using [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) and [here](http://scikit-learn.org/stable/modules/svm.html#svm-regression). Validation curves are a way of quantifying the question: _which fit is the best fit?_ Data scientists are probably used to seeing things like reduced $\\chi^2$. The purpose is the same, but these tools are built together in a python module we'll be using extensively, called [scikit-learn](http://scikit-learn.org/stable/). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll import all the stuff we're going to need, just to get that out of the way." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Standard modules\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns\n", "plt.style.use('seaborn')\n", "from scipy.io.idl import readsav\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.model_selection import validation_curve, ShuffleSplit\n", "from sklearn.metrics import explained_variance_score, make_scorer\n", "from sklearn.svm import SVR\n", "from pandas.plotting import register_matplotlib_converters\n", "register_matplotlib_converters()\n", "\n", "# Custom modules\n", "from jpm_time_conversions import metatimes_to_seconds_since_start, datetimeindex_to_human" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and clean data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will load up the data. You can download that dataset from [here](https://www.dropbox.com/s/hmrb6eajwv6g6ec/Example%20Dimming%20Light%20Curve.sav?dl=0) or from the HelioML folder containing this notebook and then just update the path below as necessary to point to it. We're using [pandas](https://pandas.pydata.org/) DataFrames largely because they are highly compatible with scikit-learn, as we'll see later. Finally, we'll use the ```head()``` function to take a quick look at the data. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | irradiance | \n", "uncertainty | \n", "
---|---|---|
2012-04-16 17:43:20 | \n", "0.246831 | \n", "0.052733 | \n", "
2012-04-16 17:44:19 | \n", "0.399922 | \n", "0.085439 | \n", "
2012-04-16 17:45:18 | \n", "0.275836 | \n", "0.058930 | \n", "
2012-04-16 17:46:17 | \n", "0.319487 | \n", "0.068255 | \n", "
2012-04-16 17:47:16 | \n", "0.920058 | \n", "0.196561 | \n", "