Analyzing the behavior of a single spectral line using unsupervised learning
by Brandon Panos$^{1,2}$
1 University of Applied Sciences and Arts Northwestern Switzerland, Bahnhofstrasse 6, 5210 Windisch, Switzerland
2 University of Geneva, CUISIP, 1205 Geneva, Switzerland
In this chapter we will explore how the spectral line associated with once ionized Magnesium (Mg II) behaves during a solar flare. To do this, we will make use of one of the most celebrated classical machine learning algorithms known as the kmeans algorithm, to conjointly analyze slitjaw images (SJI) and spectra collected by the Interface Region Imaging Spectrograph (IRIS).
This chapter draws heavily on the functionality provided by IRISreader, a library constructed for the purposes of supporting machine learning exploration on data collected by IRIS. The contents presented in this chapter are hosted on Github and are reflections of the publication Panos et al, 2018, ApJ, 806, 14 .
Figure 1 — Example of kmeans applied to several different flares. The emergent spectra are color coded in accordance to which group (lower left panel) they are most similar to.
# We will make use of he following libraries for this study:
import irisreader as ir
ir.config.verbosity_level=0
from irisreader.data.mg2k_centroids import get_mg2k_centroid_table, normalize
from irisreader import observation
from sklearn.datasets.samples_generator import make_blobs
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import MiniBatchKMeans
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
from IPython.display import Image
import matplotlib.pyplot as plt
from scipy.spatial import distance_matrix
import scipy.io
import scipy
import numpy as np
np.random.seed(0)
import warnings
warnings.filterwarnings("ignore")
from utils import *
1. Introduction
In this section, we discuss some basic physics and instrumentation surrounding the main data types generated by the IRIS satellite. We focus our study on a particularly informative spectral line, the resonant kline of once ionized magnesium, and conclude with the posing of two simple but vital questions that we wish to address with the help of machine learning.
What does our instrument see?
The majority of data for this chapter is from NASA’s Interface Region Imaging Spectrograph (IRIS) small explorer spacecraft \cite{IRIS}, which is equipped with both a spectrograph and a slitjaw imager (SJI). The spectrograph monitors several important ions emitting in the ultraviolet, while the imager provides simultaneous high resolution contextual images of the solar atmosphere in three different passbands spanning the NUV and FUV wavelengths. IRIS includes a variety of observational modes to serve the particular research requirements, however, this makes big data analysis a challenging endeavor, since the data is highly heterogeneous. An example of the two data types, images in different passbands and spectra, can be seen in the proceeding two plots:
# define the path to the IRIS data
# /data2/iris/2014/03/29/20140329_140938_3860258481
path = '~/HelioChapter/IRISData/'
# load a flare observation using IRISreader
obs = observation( path + "20140329_140938_3860258481", keep_null=True )
# see what SJI channels the observation was recored in
obs.sji.get_lines()
field  wavelength  description  

0  FUV2  1400.0  Si IV 1400 
1  NUV  2796.0  Mg II h/k 2796 
2  NUV  2832.0  Mg II wing 2832 
# Plot SJIs of the solar atmosphere in the three different passbands
plt.figure( figsize=(25,10) )
plt.rcParams.update({'font.size': 15})
gs = gridspec.GridSpec(1, 3)
ax1 = plt.subplot(gs[0,0:1])
ax2 = plt.subplot(gs[0,1:2])
ax3 = plt.subplot(gs[0:,2:3])
ax1.imshow( obs.sji("Mg II wing").get_image_step(49).clip(min=0)**0.3, origin="lower", cmap="Greys", vmax=10 )
ax1.title.set_text('Mg II wings: 2832 $\AA$')
ax2.imshow( obs.sji("Mg II h/k").get_image_step(496).clip(min=0)**0.3, origin="lower", cmap="Greys", vmax=10 )
ax2.title.set_text('Mg II h&k, 2794 $\AA$')
ax3.imshow( obs.sji("Si").get_image_step(496).clip(min=0)**0.3, origin="lower", cmap="Greys", vmax=10 )
ax3.title.set_text('Si IV: 1400 $\AA$')
plt.tight_layout()
Figure 2  SJI’s of the same observation in three different passbands.
We can observe different heights in the solar atmosphere by restricting the band of wavelengths our instrument collects. The abundance and opacity of each atomic transition maps out a noisy “surface of last scattering”, where a large proportion of photons can escape the solar atmosphere to form an image on our instruments. For instance, at optical depth unity, $e^{1}$ photons slip through the atmosphere without incurring any scattering events. Additionally, the steep and rich temperature structure of the solar atmosphere allows atoms to access transitions involving lower principal quantum number due to increased ionization levels at higher altitudes. The complex coupling between hight and wavelength is illustrated in the SJI’s, with the $\sim$ 2832 $Å$ band scraping the upper photosphere, $\sim$ 2794 $Å$ band providing an excellent view of the chromosphere and the $\sim$ 1400 $Å$ band being sensitive to plasma of 65000 K, allowing views of the upper chromosphere and glimpses of the lower transition region.
The white strip running down the center of each SJI marks the location of the spectrograph, which allows a small fraction of filtered light to pass onto a grating, where it is split into its finer constituent parts, giving rise to the second data type… spectra.
# Plot Mg II h&k spectra
obs.raster("Mg II k").crop()
plt.figure( figsize=(25,10) )
plt.rcParams.update({'font.size': 15})
gs = gridspec.GridSpec(1, 2)
ax1 = plt.subplot(gs[0,0:1])
obs.raster("Mg II k").plot(0, cmap='Greys')
ax2 = plt.subplot(gs[0,1:2])
obs.raster("Mg II k").plot( 0, y=100)
ax2.margins(x=0)
Figure 3  Mg II spectra collected from the spectrograph. The right panel is a single slice at a fixed hight along the slit.
Spectral lines are the bread and butter of astronomy, and heliophysics is no exception. The shape of the spectra encodes a vast amount of physical information which can be leveraged by researchers to better understand remote environments such as the solar atmosphere. The above spectra is collected along the slit and shows the two resonant h&k lines of Mg II, which are amongst the strongest lines in the UV solar spectrum. The panel on the right displays a single profile sourced from a fixed yvalue along the slit. It is important to realize that we do not have spectral information over the entire SJI. The h&k lines are symmetric apart from a factor of two in their oscillation strengths. Although there is a great deal of information to be gained from their relative positions and intensities \cite{Paper2}, as a first study we consider only the strongest of the two lines, the Mg II kline.
Why is the Mg II kline an excellent source of information?
The kline has a formation hight that extends over the entire chromosphere as well as extensively damped wings which follow the temperature structure of the upper photosphere. The 2v and 2r peaks are formed in the midchromosphere, with line core photons at k3 originating from an altitude just below the transition region. The large opacities of the resonant line implies a small geometrical thermalization length, and hence high sensitivity to temperature, density and velocity gradients \cite{sensitivity}. Because of these attributes, studying this particular resonant line seems like a fruitful endeavor.
Forward modeling has provided us with great diagnostic tools for the quiet sun \cite{Paper1}, however, these diagnostics cannot be blindly extrapolated to the vastly different flaring atmosphere. For instance, the clear central reversal at k3 due to the uncoupling of the Plank and source function in the tenuous upper chromosphere is often seen in emission for flares (e.g. \cite{Lucia}).
Figure 4 — An example of the common nomenclature associated with the features of the Mg II resonant h&k lines. We analyze the kline on the left.
The questions that we would like to answer in this chapter are the following:
 Are the Mg II kline profiles generated during a solar flare different from those of the quiet sun?
 And if so, do different flares generate the same types of profiles?
2. kmeans
In this section, we discuss the mathematics behind the kmeans algorithm. We then experiment with kmeans on two very different artificial data sets to try and understand how to select the only free parameter in the algorithm, namely, the number of groups $k$. Finally, we apply the same methods to real life spectral data from an IRIS observation, and infer some properties based on what we have learned from the artificial data sets.
The kmeans algorithm
The kmeans algorithm \cite{macqueen1967} is a form of vector quantization, where a data set of points/vectors is divided into a specified number of groups, where each group is represented by a prototype or centroid which is the group average. kmeans in effect is a sophisticated multidimensional binning technique, where the space of observations is represented by a smaller set of prototypes. In other words, kmeans gives us a tractable observation space by providing us with a coarse representation of the original unwieldily space.
The algorithm works by initiating $k$ centroids into the data space. Each point is then assigned to the centroid they appear nearest to under the $L2$ norm. The kmeans objective is to 1) find the assignments and 2) find the positions of the centroids that minimize the total within cluster distance or “inertia” where $\mu_j$ is the $j\text{‘th}$ group centroid, $x_i$ the $i\text{‘th}$ observations and $\delta_{c_i,j}$ the kronecker delta defined as
The kmeans algorithm minimizes the objective function $\mathcal{L}$ by initiated the centroids in the locations of $k$ randomly selected data points, and then iterating over a two step procedure known as coordinate decent:
 The data is partitioned into groups based on which centroid they appear nearest to
 The centroids are updated as the mean position of each group
This two step iteration terminates when either the centroids stop updating their location, or the distance between the old and new centroid positions are all below some specified threshold. The clustering depends on the initial random placement of centroids, hence it is customary to repeat the process a number of times ad select the clustering with the lowest total inter cluster distances.
kmeans has a time complexity which is effectively linear \cite{kmeans_time}, making it big data scalable, however, it requires the manual selection of the number of groups $k$. A number of techniques have been proposed to automate the choice of $k$. We will explore one of the popular methods known as the elbow technique. We provide a fast minibatch version of the kmeans algorithm below, which will be used in this chapter. For the details of this version, we refer the reader to the source code at scikitlearn.
# kmeans algorithm
def mini_batch_k_means(X, n_clusters=10, batch_size=10, n_init=10, verbose=0):
'''
Retun centroids and labels of a data set using the kmeans algorithm in minibatch style
input  X: data in the form [m_examples, lambda]
 n_clusters: number of groups to find in the data
 batch_size: number of data points used for a single update step of the centroids (these add together in a running stream)
 n_init: number of convergences tested, the itteration with the lowest inertia is retained
 verbose: output statistics
output:  centroids: form [n_clusters, lambda], mean points of groups
 labels: list of assignments indicating which centroid each data point belongs to
 inertia: measure of performance, sum of all intercluster distances
'''
mbk = MiniBatchKMeans(init='kmeans++', n_clusters=n_clusters, batch_size=batch_size,
n_init=n_init, max_no_improvement=10, verbose=verbose)
mbk.fit(X)
centroids = mbk.cluster_centers_
labels = mbk.labels_
inertia = mbk.inertia_
return centroids, labels, inertia, n_clusters
Finding the optimal number of groups $k$
The central question for this algorithm is: how many groups should we partition our data into? For some data sets, there exists a natural number of groups which can be found by monitoring how the inertia behaves when successively increasing the group number. In the shoulder method, the optimal number of groups is given by the point where the inertia drastically decreases and then subsequently undergoes only a gradual improvement with each additional group. This behavior when plotted looks like a shoulder and is where the technique derives its name. Although such a technique is useful for some data sets, it fails for others. We will demonstrate this with two synthetic data sets and then apply the same technique on a real data set containing spectra and compare the results.
# generate two sample data sets, X1 and X2
batch_size = 45
centers = [[1, 1], [1, 1], [1, 1], [1, 1], [0, 0]]
X1, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.2)
X2, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=10)
# plot both data sets
fig = plt.figure( figsize=(15,5) )
ax1=plt.subplot(121)
ax1.plot(X1[:, 0], X1[:, 1], 'w', markerfacecolor='k', marker='X')
plt.title('Data set X1 with an obvious clustering')
ax2=plt.subplot(122)
ax2.plot(X2[:, 0], X2[:, 1], 'w', markerfacecolor='k', marker='X')
plt.title('Data set X2 with no obvious structure')
plt.show()
Figure 5  Two very different data sets for exploring a potential method for discerning the number of groups $k$ to use in the kmeans algorithm.
# collect lists of both data sets inertia for successive increases in group number
X1_list_of_inertias = []
for n_groups in range(10):
centroids, labels, inertia, n_clusters = mini_batch_k_means( X1, n_clusters=n_groups+1 )
X1_list_of_inertias.append( inertia )
X2_list_of_inertias = []
for n_groups in range(10):
centroids, labels, inertia, n_clusters = mini_batch_k_means( X2, n_clusters=n_groups+1 )
X2_list_of_inertias.append( inertia )
# plot inertia vs. group number for each data set
fig = plt.figure( figsize=(15,5) )
ax1=plt.subplot(121)
ax1.plot( np.linspace(1,10,len(X1_list_of_inertias)), X1_list_of_inertias, color='k' )
ax1.xaxis.set_major_locator(plt.MaxNLocator(10))
ax1.axvline(x=5, linestyle='')
ax1.margins(x=0)
plt.ylabel('Inertia', size=15)
plt.xlabel('Number of groups', size=15)
plt.title('Inertia for data set X1 with varying group number')
ax2=plt.subplot(122)
ax2.plot( np.linspace(1,10,len(X2_list_of_inertias)), X2_list_of_inertias, color='k' )
ax2.xaxis.set_major_locator(plt.MaxNLocator(10))
ax2.axvline(x=3, linestyle='')
ax2.margins(x=0)
plt.ylabel('Inertia', size=15)
plt.xlabel('Number of groups', size=15)
plt.title('Inertia for data set X2 with varying group number')
plt.tight_layout()
plt.show()
A clear shoulder appears for data set X1, suggesting a natural choice for the number of groups as 5. On the other hand, data set X2 has a less distinguishable shoulder, and hence no clear group number. We will now plot the clustering results from the kmeans algorithm with the plotting function below.
# plotting function to visualize the results of kmeans
def plot_kmeans_results( X, labels, n_clusters, ax=None, **kwargs ):
ax = ax or plt.gca()
colors = ['#FF9C34', 'black', 'gold', 'aquamarine', '#4EACC5',
'#FF9C34', '#4E9A06', 'pink', 'blue', 'grey', 'red']
for k, clr in zip(range(n_clusters), colors):
my_members = labels == k
ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=clr, marker='X')
ax.set_title('Kmeans results')
plt.grid(True)
return None
# use kmeans with k = 5 for data set X1 and X2
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))
centroids, labels, inertia, n_clusters = mini_batch_k_means( X1, n_clusters=5 )
plot_kmeans_results(X1, labels, n_clusters, ax1)
centroids, labels, inertia, n_clusters = mini_batch_k_means( X2, n_clusters=5 )
plot_kmeans_results(X2, labels, n_clusters, ax2)
Figure 6  kmeans clustering results for data set X1 (left) and X2 (right).
Applying kmeans to an IRIS observation
Although there are two strong resonant Mg II lines in the spectral window, for simplicity we have chosen to focus on just the kline. To this end, we have to specify a maximum and minimum wavelength in order to partition and extract only the strongest of the two lines. Additionally, we will provide the size of the wavelength grid to interpolate onto.
lambda_min = 2793.85
lambda_max = 2799.32
n_bins = 216 # number of wavelength points to interpolate onto
line_core = 2796.34 # location of the Mg II kline core
nprofxax = np.linspace( lambda_min, lambda_max, n_bins ) # new wavelength xaxis for plotting results
For this real world application, we have to clean the data in several ways before we can apply the kmeans algorithm and the shoulder technique.

IRIS uses different spectral resolutions based on the observation description and research requirements. This means that spectra form different observations have to be interpolated onto the same wavelength grid. It is important for machine leaning applications to have the same size feature space for each example. This procedure is automatically handled by the functionality provided in IRISreader.

The spectra has to be reshaped into the same form as data set X1 and X2.

Missing data in IRIS appears with a value of 200 and should be removed from the data set.

It is advisable to normalize each spectra by its maximum value, this has the consequence of reducing the parameter space and speeding up algorithms predicated on gradient descent, however, disregarding the intensity means that all applications after this step are only concerned with the shapes of the spectra.
To this end, we have provided a number of cleaning functions below.
# functions to clean and place our spectral profiles into a 216 dimensional vector space
def profile_rep( data ):
'''
put the data into the form [ m_examples, lambda ]
'''
return data.reshape( data.shape[0] * data.shape[1] * data.shape[2], data.shape[3] )
def clean(X, mode='del'):
'''
clean profiles which meet the following criterion:
 Negative values > condition = profile with any value <= 100
 Overexposure > condition = more than 3 points == max
input  X: matrix (example, feature)
 modes:
1) 'ones' > replaces bad profiles with a vector of ones
2) 'nans' > replaces bad profiles with a vector of nan's
3) 'del' > deletes bad profiles (this changes dimentional structure, but usefull for unsupervised)
'''
# Clean profiles with negative
small = np.min(X, axis = 1)
# Clean overexposed profiles
maxx = np.max(X, axis=1).reshape(len(X),1)
diff = (Xmaxx)
w = np.where(diff == 0)
dumb = np.zeros((X.shape[0], X.shape[1]))
dumb[w[0], w[1]] = 1
s = np.sum(dumb, axis = 1)
if mode == 'ones':
X[np.logical_or(small <= 100, s >=3)] = np.full((1, X.shape[1]), fill_value = 1)
if mode == 'nans':
X[np.logical_or(small <= 100, s >=3)] = np.full((1, X.shape[1]), fill_value = np.nan)
if mode == 'del':
X = np.delete(X, np.concatenate((np.where(small<=100)[0], np.where(s>=3)[0])), axis=0)
return X
def normalize( X ):
'''
normalize each profile by its maximum
'''
maxx = np.max( X, axis=1 ).reshape(1, len( X ) ).T
return X/maxx
# load observation with IRISreader
obs = observation( path + "20140910_112825_3860259453", keep_null=True)
data, raster, sji = extract_flare( obs, lambda_min, lambda_max, n_bins, verbosity_level=0 )
# reshape and clean the data with the above functions
X = profile_rep( data )
X = clean( X, mode='del' )
X = normalize( X )
# collect a list of the data sets inertia for successive increases in group number
# If the calculation is taking to long you can reduce the batch_size, however, this will result in an erratic plot.
# For the purposes of producing an inertia vs. group number plot, the minibatch kmeans has been set to run
# as normal kmeans by selecting batch_size = len(X). The calculation should take around 4 min
list_of_inertias = []
for n_groups in range(10):
centroids, labels, inertia, n_clusters = mini_batch_k_means(X, n_clusters=n_groups+1,
batch_size=len(X),
n_init=10,
verbose=0)
list_of_inertias.append( inertia )
# plot inertia vs. group number for our spectral data set
fig, ax = plt.subplots( figsize=(12,7) )
plt.plot( np.linspace(1,10,len(list_of_inertias)), list_of_inertias, color='k' )
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
plt.axvline(x=2, linestyle='')
plt.margins(x=0)
plt.ylabel('Inertia', size=15)
plt.xlabel('Number of groups', size=15)
plt.tight_layout()
plt.show()
Figure 7  Inertia v.s. number of groups for a real IRIS spectral data set. There is a clear sholder in the inertia at number of groups = 2, however, this simply devides quiet sun and flaring spectra as seen in Figure 8.
# seperate the data into two groups X_qs and X_fl
centroids, labels, inertia, n_clusters = mini_batch_k_means( X, n_clusters=2, n_init=20 )
# plot the centroid of each group
X_qs = centroids[0]
X_fl = centroids[1]
fig, ax = plt.subplots(figsize=(12,7))
ax.plot(nprofxax, X_qs, label='quiet sun')
ax.plot(nprofxax, X_fl, label='flare')
ax.axvline(x=line_core,color='black',linewidth = 1)
plt.xlabel('Wavelength [$\AA$]')
plt.ylabel('Normalized intensity')
ax.legend()
ax.margins(x=0)
plt.tight_layout()
Figure 8  Centroids using the suggested parameter $k=2$ from the shoulder technique. This superfluous separates quiet sun and flaring type spectra.
Apart from this separation, setting the number of groups equal to 2 does not offer us enough “information resolution” to study the dynamics of the solar atmosphere. Additionally, the inertia vs. group number plot after $n=2$ improves excessively slowly with the addition of more groups, implying that our real life spectral data resembles a structureless data set similar to that of X2. It is often the case, and spectral analysis with IRIS is no exception, that the selection of the number of groups to use in kmeans can not be automated, and should be left to the discretion of a professional in the field.
3. Application of kmeans to IRIS data
It appears that we need to reach for an external source of data other that the raw spectra in order to decide how to fix $k$. In this section, we load three IRIS observations, one quiet sun, and two flaring observations, which we refer to as qs, f1 and f2 respectively. We apply the kmeans algorithm to the qs observation and then create what we call a “centroid mask”, indicating in a live SJI animation the group assignments of the emergent spectra. We then take the centroid mask generated for the qs observation and try to apply it to one of the flaring observations. The results indicate the incompatibility of the mask, meaning that the spectra from flaring atmospheres is markably different form those of the quiet sun. In the light of these observations, we create a new mask predicated on one of the flares, and see whether this mask is still compatible with the remaining flare observation. If it is, then we can start drawing conclusions about the possible similarity of chromospheric physics in flaring atmospheres.
Our observations:
We will load our three IRIS observations using the preprocess_obs and extract_flare functions of IRISreader provided in Utils.py. The functions return a plot of the GOES Xray flux over the green period where the IRIS observations were extracted. In the case of the qs observation, the total integrated soft Xray flux is below flaring level and stable, in comparison to the flaring observations which indicate regions of high Xray flux. Animated movies of each observation obtained by interweaving consecutive SJIs together accompany the GOES plots. In addition to this, a link to the Heliophysics Events Knowledgebase (HEK) is provided if the reader requires a more detailed description of the observations. Finally, a small insert of the data, raster and sji shapes are provided.
obs_qs = observation( path + "20140208_133217_3803257203", keep_null=True)
data_qs, raster_qs, sji_qs = preprocess_obs( obs_qs, "QS", lambda_min, lambda_max, n_bins, verbosity_level=2 )
index interval for raster position 0: 0294 (totalling 294 exposures, 24.9 minutes)