Galaxy classification: A machine learning analysis of GAMA catalogue data

(1)

University of Groningen

Galaxy classification

Nolte, Aleke; Wang, Lingyu; Bilicki, Maciej; Holwerda, Benne; Biehl, Michael

Published in:

Neurocomputing

DOI:

10.1016/j.neucom.2018.12.076

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date:

2019

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Nolte, A., Wang, L., Bilicki, M., Holwerda, B., & Biehl, M. (2019). Galaxy classification: A machine learning

analysis of GAMA catalogue data. Neurocomputing, 342(SI), 172-190.

https://doi.org/10.1016/j.neucom.2018.12.076

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

ContentslistsavailableatScienceDirect

Neurocomputing

journalhomepage:www.elsevier.com/locate/neucom

Galaxy

classiﬁcation:

A

machine

learning

analysis

of

GAMA

catalogue

data

Aleke

Nolte

a,∗

_,

_Lingyu

_Wang

b,c

_,

_Maciej

_Bilicki

d,e

_,

_Benne

_Holwerda

d,f

_,

_Michael

_Biehl

a

a Bernoulli Institute for Mathematics, Computer Science and Artiﬁcial Intelligence, University of Groningen, P.O. Box 407, Groningen 9700 AK, the Netherlands b Kapteyn Astronomical Institute, University of Groningen, Landleven 12, Groningen 9747 AD, the Netherlands

c SRON Netherlands Institute for Space Research, the Netherlands

d Leiden Observatory, Leiden University, Leiden P.O. Box 9513, 2300 RA, the Netherlands

e Center for Theoretical Physics, Polish Academy of Sciences, al. Lotników 32/46, Warsaw 02-668, Poland

f Department of Physics and Astronomy, University of Louisville, 102 Natural Science Building, Louisville, KY 40292, USA

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 20 July 2018 Revised 1 November 2018 Accepted 16 December 2018 Available online 3 February 2019

Keywords:

Learning Vector Quanti z ation Relevance learning Galaxy classiﬁcation Random Forests

a

b

s

t

r

a

c

t

WepresentamachinelearninganalysisoffivelabelledgalaxycataloguesfromtheGalaxyAndMass As-sembly(GAMA):TheSersicCatVIKINGandSersicCatUKIDSScataloguescontainingmorphologicalfeatures, theGaussFitSimple cataloguecontainingspectroscopicfeatures,theMagPhyscatalogueincluding physi-calparametersforgalaxies,andtheLambdarcatalogue,whichcontainsphotometricmeasurements. Ex-tendingworkpreviouslypresentedattheESANN2018conference– inananalysisbasedonGeneralized RelevanceMatrixLearningVectorQuantizationandRandomForests– wefindthatneitherthedatafrom the individualcatalogues noracombined datasetbasedonall 5catalogues fullysupports the visual-inspection-basedgalaxyclassificationschemeemployedtocategorisethegalaxies.Inparticular,onlyone class,theLittleBlueSpheroids,isconsistentlyseparablefromtheotherclasses.Toaidfurtherinsightinto thenatureoftheemployedvisual-basedclassificationschemewithrespecttophysicalandmorphological features,wepresentthegalaxyparametersthatarediscriminativefortheachievedclassdistinctions.

1. Introduction

Telescopeimagesofgalaxiesrevealamultitudeofappearances, rangingfromsmoothellipticalgalaxies, throughdisk-likegalaxies withspiralarms, tomoreirregularshapes. Thestudyof morpho-logicalgalaxyclassiﬁcation playsan importantrole inastronomy: thefrequencyandspatialdistributionofgalaxytypesprovide valu-able information for the understanding of galaxy formation and evolution[1,2].

Theassignmentofmorphologicalclassestoobservedgalaxiesis ataskwhichiscommonlyhandledbyastronomers.Asmanual la-bellingofgalaxiesistimeconsumingandexpert-devised classifica-tionschemesmaybesubjecttocognitivebiases,machinelearning techniqueshavegreat potential to advanceastronomyby: (1) in-vestigatingautomatic classification strategies, and(2)by evaluat-ingto whichextent existing classification schemesare supported bytheobservationaldata.

Inthiswork,weextendapreviousanalysis[3]tomakea con-tribution along both lines by analysing several galaxycatalogues

∗ _{Corresponding author.}

E-mail address: a.f.nolte@rug.nl (A. Nolte).

which have been annotated using a recent classiﬁcation scheme proposed by Kelvin et al.[4]. In ourprevious study, we assessed whetherthisschemeisconsistentwithagalaxycatalogue contain-ing42astronomicalparametersfromtheGalaxyAndMassAssembly

(GAMA,[5])byperformingbothanunsupervisedandasupervised analysiswithprototype-basedmethods.Weassessedwhetherclass structurecanberecoveredbyaclusteringofthedatageneratedby theunsupervisedSelf-OrganizingMap(SOM)[6],andinvestigated if the morphological classification can be reproduced by Gener-alizedRelevanceMatrixLearningVectorQuantization(GMLVQ)[7], a powerfulsupervisedprototype-based method[8]chosen for its capabilityto not onlyprovideclassification boundariesand class-representative prototypes, but also feature relevances. Finding consistently negativeresultsfor thesupervisedand unsupervised method, namely an intermediate classification accuracy of GM-LVQ ofaround 73% andno clear-cut agreements betweengalaxy classesandSOM-clusteringresults,weconcludedtheclassification scheme tobe not fully supported by theconsidered galaxy cata-logue.As discussedpreviously [3]thehypothesised misalignment betweengalaxydataandclassificationschemecouldbeexplained bylackofdiscriminativepoweroftheemployedclassifiersor clus-teringmethods,by mis-labellingsofcertain galaxies(apossibility https://doi.org/10.1016/j.neucom.2018.12.076

(3)

Table 1

Overview of galaxy catalogues analysed in this work. Shown are also the number of samples for which complete information, i.e. information from each of the catalogues, is available, and the number of samples in the ﬁnal dataset considered in the remainder of this work.

Catalogue Shorthand Number of samples after preprocessing GaussFitSimple GFS 7430 galaxies with 59 emission line features

Lambdar Lambdar 7365 galaxies with 28 ﬂux measurements and uncertainties for different bands MagPhys MagPhys 7541 galaxies with 171 features

SersicCatVIKING Viking 5476 galaxies with 66 Sérsic features SersicCatUKIDSS Ukidss 3008 samples with 53 Sérsic features Complete information from all catalogues 2117 galaxies

Final sample (cf. Section 2.6 ) 1295 galaxies

alreadydiscussedin[9]),orbytheabsenceofessentialparameters inthedataset.Inthiswork,weaddresstwoofthementioned as-pects:We employan additionalestablishedandflexible classifier, RandomForests[10]tocollectevidencethat thepreviouslyfound moderate classification performance is not due to shortcomings ofGMLVQ. Furthermore,weaddress thepotential incompleteness of the previously analysed dataset by performing another set of supervised analyses on several additional galaxycatalogues from the GAMA survey [11], which contain a multitude of additional photometric,spectroscopicandmorphologicalmeasurements.

Despite the commonly quoted abundance of data in astron-omy,well-acceptedbenchmarkdatasetsarenotreadilyavailablein the field of galaxyclassification, andonly a few worksanalysing GAMAcatalogueswithmachinelearningmethodsexist.Inan anal-ysis by Sreejithet al.[9], 10features fromGAMA catalogues are hand-selected and analysed using Support Vector Machines, De-cision Trees, Random Forests and a shallow Neural Network ar-chitecture. Withrespect to Kelvin et. al’s classification scheme a maximumclassificationaccuracyof76.2%isreported.Turneretal. [12] perform an unsupervised analysis of five hand-selected fea-tures fromGAMAcatalogues usingk-meansclustering. While not the mainaim ofTurner etal.’sanalysis, a comparisonofthe de-terminedclusterswithclassinformationfromKelvin etal.shows galaxies that are assignedthe same class by Kelvin et al.spread overseveralclusters(Figs.11,13,15and17in[12]).

In agreementwithourprevious resultsandthe analyses from theabovementionedliterature,wefindtheemployedclassification schemeto not befullysupported even whenconsidering the ad-ditionalcataloguesandan alternativeclassifier.Interestingly, anal-ogous toourprevious work [3],theLittleBlueSpheroids,a galaxy class newly introduced in [4], remains most clearly pronounced, also for the set ofcatalogues analysed in thiswork. We present the parameters that are the most relevantfor the achievedclass distinctions.

The paper is organised in as follows: In Section 2 the anal-ysed galaxy catalogues and their preprocessing is described. Section 3 outlines the employed classiﬁcation methods, GMLVQ andRandomForests.Section 4describesexperimental setupsand results.TheworkcloseswithadiscussioninSection5.

Thispaper constitutesan extension ofourcontribution tothe 26thEuropeanSymposiumon ArtiﬁcialNeuralNetworks, Compu-tationalIntelligenceandMachineLearning(ESANN)2018[3].Parts of thetext havebeen takenover literally withoutexplicitnotice. This concerns, among others, parts of the introduction and the descriptionofGMLVQinSection3.

2. Data

In this work we analyse data from ﬁve galaxy catalogues (Table1)containingfeatureswhichhavebeenderivedfrom spec-troscopic andphotometric observations,i.e.measurements ofﬂux intensities in different wavelength bands from the Galaxy And

Mass Assembly (GAMA) survey [11] for a sample of 1295 galax-ies. As the catalogues contain information for different sets of galaxies, our data set consistsof the set of galaxies for which a fullsetoffeaturesisavailable afterbalancingtherelevantclasses (cf.Section2.6).

Todetermine thisset, each catalogue is first cross-referenced withthegalaxysampleanalysed inourESANN contribution[3,9], which contains class labels for 7941 astronomical objects. The resulting subsample is further preprocessed by selecting mea-surementsbasedonthespecificsofeach catalogue.Subsequently, missing values are treated by first removing feature dimensions with a considerable amount of missing values (more than 500 missingvalues per feature dimension) andthen discarding sam-pleswhichcontainmissingvaluesinanyoftheremaining feature dimensions.

Detailsofeachcatalogueaswellasspeciﬁcprocessingstepsare delineatedinthefollowingparagraphs.

2.1. GaussFitSimple

The GaussFitSimple catalogue (GFS) [13] contains parameters of Gaussian ﬁts to 12 important emission lines found in galaxy spectra,namelytheemissionlinesofoxygen([OI]emission lines at6300 ˚Aand6364 ˚A,in thefollowing denoted asOIBandOIR, [OII]linesat3726 ˚Aand3729 ˚A,denotedasOIIBandOIIR,[OIII] linesat4959 ˚Aand5007 ˚A,denotedasOIIIRandOIIIB),nitrogen ([N II] lines at 6548 ˚A and 6583 ˚A, NIIR and NIIB), sulphur ([S II]lines at6716 ˚A and6731 ˚A, SIIR andSIIB), andhydrogen (H

α

and H

β

lines at 6563 ˚A and 4861 ˚A, respectively). Further, the cataloguecontains slope andinterceptof the continuum, that is, the background radiation in-between emission lines. In addition totheseparametersthecataloguealsocontainsmeta-information concerningmodelﬁtsandcorrespondingerrors.

From the GaussFitSimple catalogue we select amplitudes (AMP_∗) andsigma (SIG_∗) of the Gaussian fit foreach emission line,as well ascalculated fluxes (∗_FLUX) andequivalent widths (∗_EW).Here andinthefollowing,the asterisk∗ isa placeholder for the name of the corresponding emission line. We further includeinformation about the continuum (CONT,GRAD) and the strengthoftheD4000break,resultingin59selectedfeatures.We discard all samples for which a failure of the fitting procedure has been indicated (FITFAIL_∗), and remove samples containing missing values in any of the feature dimensions. The resulting sub-catalogue then contains 7430galaxies with 59 emission line features.

We note that the classiﬁcation performance on the full catalogue, which contains model ﬁt information and er-rors/measurement uncertainties is comparable to the results achieved with the reduced catalogue containing 59 features (cf. Section 4). As the selected parameters allow for a more direct interpretation in terms of emission line strengths and therefore

(4)

facilitate interpretation from the astronomical perspective, we considerthereducedcatalogueinthefollowing.

2.2.Lambdar

The Lambdar catalogue [14] contains flux measurements and uncertaintiesfor21bands,asmeasuredbytheLAMBDARsoftware [14]. When cross-referencing with the catalogue analysed in our preceding study, 400 galaxies are missing from the Lambdar catalogue.Thesegalaxiesareremovedfromtheconsidered Lamb-dar subset and do not contribute to the ensuing missing value calculations. Columns still containing a considerable amount of missing values after this step (> 500) are excluded from the analysis. The removed columns contain parameters that include fluxesand errors in the far and near Ultraviolet (UV) (FUV_flux, FUV_fluxerr,NUV_flux,NUV_fluxerr),andfluxesanderrorsinthe 100 m to 500 m bands (P100_flux, P100_gcfluxerr, P160_gcflux, P160_gcfluxerr, S250_gcflux, S250_gcfluxerr, S350_gcflux, S350_gcfluxerr, S500_gcflux, and S500_gcfluxerr). After remov-ingthese, 28featuresremaininthe catalogue,namelyfluxesand errorsforu,g, r,iandzbandsobserved intheSloanDigital Sky Survey(SDSS,[15]),Z,Y,J,HandKbandsfromVISTAKilo-Degree Infrared Galaxy Survey (VIKING, [16]), and W1, W2, W3 and W4 bands from the Wide-field Infrared Survey Explorer (WISE, [17]). After thisstep,samples that are missingmeasurements for any of the remaining features are removed, resulting in a final sub-catalogueof7365galaxieswith28features.

2.3.MagPhys

TheMagPhyscatalogue[18]containsphysicalparameters com-prisinginformationaboutstellarpopulationsaswellasparameters describing the inter-stellar medium in the galaxies. Parameters include,amongothers, starformation rates, starformation time-scales, information about star formation bursts, as well as the masses of stars formed in the bursts, overall stellar ages and masses,metallicities, andinformationaboutdustinthe interstel-larmedium andinstellar birthclouds; all thisforeach included galaxy.AllMagPhys parametershave beenderived from informa-tion provided in the Lambdar catalogue (Section 2.2) using the MAGPHYSprogram[18,19].DuetomissingvaluesintheLambdar catalogue, the MagPhys catalogue does not contain information for400 of the galaxies analysed in our ESANN contribution [3]. Apart fromthese, there are no missing values, so that informa-tion from 177 MagPhys features is available for 7541 galaxies. However, after selecting the ﬁnal sample (cf. Section 2.6) some parameters exhibit almost no variance over the considered samples: Parameters fb17_percentile2_5, fb18_percentile2_5, fb17_percentile16, fb17_percentile50, fb17_percentile84 and fb18_percentile161 are largely constant, with maximally 15 data pointsdisplayingdeviations.We thereforeremove thesefeatures, which results in a dimensionality of 171 for the ﬁnal MagPhys sample.

Information on the MagPhys parameter shorthand notation usedintheremaindercanbefoundin[20].

2.4.Sérsiccatalogues

Three different catalogues are available which contain param-etersofsingle-Sérsic-component ﬁtsto the2D surfacebrightness distributionof galaxies indifferent bands[21]. The single-Sérsic-componentﬁtshavebeenproducedwiththeGALFITprogram[22].

1 Percentiles of the likelihood distribution of parameters describing the fraction of the effective stellar mass formed in bursts over the last 10 7 and 10 8 years.

The catalogues contain a parameter, GALPLAN_∗, which indicates GALFIT fitting failures for each band, where the asterisk ∗ is a placeholderfortheband.GALPLAN_∗=0indicatesaseverefailure when fitting the surface brightness profile of the galaxy, which could not be amended by attempting a number of correction strategies.WethereforediscardallsampleswhereGALPLAN_∗=0. An additionalgoodness-of-fit parameter allowing to judgethe quality ofprofile fittingisthe PSFNUM_∗ parameter.This param-eter indicates the number of prototype stars used to model the pointspreadfunction(PSF)inthegalaxyimage towhichthe sur-facebrightnessprofilewasfit.AsindicatedintheGAMAcatalogue description,modellingPSFsbasedonlessthan10starsmayresult inpoorPSFmodels,whichinturnmayresultinpoorlyfitted sur-face brightness distributions. Accordingly, we discard all samples wherethePSFNUM_∗parametershaveavaluelowerthan10.

The catalogue further contains meta-information needed to reproduce the results of the GALFIT ﬁtting.Here we concentrate on parameters that are descriptors of galaxies as opposed to parameters describing the ﬁtting procedure. The galaxy descrip-tors, all GALFIT-derived, are: GALMAG_∗, the magnitude of the Sérsic model; GALRE_∗, the half-light radius measured along the semi-major axis; GALINDEX_∗, the Sérsic index; GALELLIP_∗, the ellipticity; GALMAGERR_∗, the error on magnitude; GALREERR_∗, the error on the half-light radius; GALINDEXERR_∗, the error on the Sérsic index; GALELLIPERR_∗, the error on ellipticity; GALMAG10RE_∗, the magnitude of a model truncated at 10 × the half-light radius; GALMU0_∗, the central surface brightness; GALMUE_∗, the effective surface brightness at the half-light ra-dius; GALMUEAVG_∗, the effective surface brightness within the half-lightradius;andGALR90_∗,theradiuscontaining90%oftotal light,measuredalongthesemi-majoraxisofthegalaxy.

2.4.1. SersicCatVIKING

The SersicCatVIKING [21] catalogue contains the above mea-surements for the VIKING bands Z, Y, J, H, and K.Based on the GALFIT failure parameter GALPLAN_∗ = 0,966 samples were re-moved fromthesub-catalogue. Additional 1074sampleswere re-movedbecauseofPSFNUM_∗ <10.Afterremovingsampleswhich havemissing valuesinany ofthe namedfeature dimensionsthe ﬁnalsub-cataloguecontains5476galaxieswith66Sérsicfeatures.

2.4.2. SersicCatUKIDSS

The SersicCatUKIDSS [21] catalogue contains the above mea-surementsfortheUKIDSS[23]bandsY,J,H,K.Basedonthe GAL-FITfailureparameterGALPLAN_∗=0,2904sampleswereremoved from the sub-catalogue. Additional 1841 samples were removed because of PSFNUM_∗ <10. After removing samples which have missingvaluesinanyoffeaturedimensionstheﬁnalsub-catalogue contains3008sampleswith53Sérsicfeatures.

2.4.3. SersicCatSDSS

For the SersicCatSDSS catalogue [21], most samples from the cross-referenced catalogue [3,9] are discarded based on the PSFNUM and GALPLAN selection, and only 1672 samples re-main.The SersicCatSDSScatalogueisthereforeexcluded fromthe analysis.

2.5. Classiﬁcationscheme

ForeachgalaxyanalysedinourESANNcontribution[3],aclass labelhas been determined by astronomersfollowing a visual in-spection based classiﬁcation scheme described by Kelvin et al. [4].The schemeassignsgalaxiesto9classes:Ellipticals, LittleBlue Spheroids,Early-typespirals, Early-typebarredspirals, Intermediate-type spirals,Intermediate-type barredspirals,Late-type spirals& Ir-regulars, ArtefactsandStars(Table2). We willreferto theclasses bytheirclassindex(1–9).

(5)

Table 2

Overview of galaxy classes in the dataset used to cross-reference the catalogues analysed in this paper. Shown are also the corresponding Hubble types, an established galaxy type descriptor in astronomy, and the class index that is used to identify classes in the remainder of the work. Gray highlights indicate the classes that are part of the ﬁnal classiﬁcation problems.

Class index Class name Corresponding Hubble type Prevalence in data set of [3,9]

1 Ellipticals E0-E6 11%

2 Little blue Spheroids – 11%

3 Early-type spirals S0, Sa 10%

4 Early-type barred spirals SB0, SBa 1%

5 Intermediate-type spirals Sab, Scd 15%

6 Intermediate-type barred spirals SBab, SBcd 2% 7 Late-type spirals and irregulars Sd - Irr 45%

8 Artefacts – 0.4%

9 Stars – 0.005%

As barred spirals, artefacts and stars are highly under-representedin thissample, oursubsequent analysiswillfocuson thesubstantialclasses,namelyclasses1,2,3,5and7.

2.6. Sampleselection

Toensure a faircomparisonbetweenthe catalogues,our final dataset comprises the subsample ofgalaxies forwhich a full set ofmeasurementsisavailable,i.egalaxiesforwhichmeasurements areprovided ineach ofthefiveconsidered catalogues.Thisisthe case for2117galaxies. Consideringonly the substantial classes1, 2, 3, 5 and 7, and balancing classes so that for each class the same number ofsamples is selected,(259, based on class 2,the classwithminimumcardinality),resultsinafinalsampleof1295 galaxies.

3. Methods:classiﬁers 3.1. GMLVQ

GeneralizedRelevance Matrix LVQ(GMLVQ) [7,8] isan exten-sion of Learning Vector Quantization (LVQ)[24]. LVQ is a super-visedprototype-basedmethod,inwhichprototypesareannotated witha classlabel. Theprototypes areadapted basedonthe label informationofthetrainingdata:ifthebest-matchingunit (BMU), the prototypeclosest to thedata point,is ofthe sameclass asa givendata point,the prototypeis moved towardsthe datapoint, whileinthecaseofaBMUwithanincorrectclasslabel, the pro-totype isrepelled.WhileLVQassessessimilaritiesbetween proto-typesanddatapointsusingtheEuclideandistance,GMLVQlearns adistancemeasurethatistailoredtothedata,allowingitto sup-press noisy feature dimensions or to emphasise distinctive fea-turesandtheirpair-wisecombinations.GMLVQthereforeconsiders ageneraliseddistance

d

(

w ,

ξ

)

=

(

ξ

− w

)

T

₍

ξ

_{− w}

₎

_with

₌

T

_and

i

ii = 1,

where

is an n× n positive semi-deﬁnite matrix,

ξ

∈Rn

repre-sents a feature vector andw∈Rn _is _one _of _M _prototypes. _After

optimisation,thediagonalof

willencodethelearnedrelevance ofthefeaturedimensions,whiletheoff-diagonalelementsencode the relevances of pair-wise feature combinations. As empirically observedandtheoreticallystudied[25,26]therelevancematrix af-tertrainingistypicallylowrankandcanbeused,forinstance,for visualisationofthedataset(seeAppendixAforanexample).

The parameters

{

w_i

}

M _i₌₁ and

are optimised based on a heuristiccostfunction,see[7],

EGMLVQ= P i =1

μ

i , with

μ

i =

(

dJ

(

ξ

i

)

− dK

(

ξ

i

))

/

(

dJ

(

ξ

i

)

+d_K

(

ξ

_i

))

, (1)

where P refers to the number of training samples, d_J

(

ξ

)

₌ d_J

(

w_J,

ξ

)

denotes the distance to the closest correctly labelled prototypew_J,andd_K

(

ξ

)

=d_K

(

w_K,

ξ

)

denotes thedistanceto the closest incorrect prototype w_K. If the closest prototype has an incorrect label, d_K

(

ξ

i

)

will be smaller than d_J

(

ξ

i

)

, hence, the

corresponding

μ

_i is positive. Minimisationof EGMLVQ will there-forefavourthecorrectnessofnearestprototypeclassiﬁcation.Ina stochastic gradient descent procedure based ona single example theupdatereads

w J,K ←w J,K −

η

w

∂

μ

i /

∂

w J,K and

←

−

η

∂

μ

i /

∂

. (2)

Derivationsand full updaterules can be found in [7]. In abatch gradientdescentversion[27],updatesoftheform(2)aresummed overalltrainingsamples.

3.2.RandomForests

RandomForests(RF)[10]isawell-knownclassificationand re-gressionmethod that employs an ensemble of randomised Deci-sionTrees[28].InrandomisedDecisionTrees,asubsetoffeatures is chosen randomly at each node. Considering only the selected features,decisionthresholdsaredeterminedbasedonthebest at-tainable split between classes. To combine the classifications of eachtreeintheensemble,i.e.todeterminetheoutputofthe Ran-domForest,differentmethodscanbeemployed.Inthescikit-learn implementationusedinourexperiments[29,30]thefinal classifi-cationoutput isobtainedbyaveragingtheprobabilisticprediction ofeachtree.

Detailsonthe set-upof theexperimentsforRF aswell asfor GMLVQcanbefoundinSection4.1.

4. Experiments

Inour experiments, we assess relevancesof features and dis-criminability between classes by training and evaluating GMLVQ foreachoftheﬁvepreprocessedcataloguesdescribedinSection2. As found in previous work [3], class 2, the Little Blue Spheroids (LBS), were particularly well-distinguishable. We perform exper-iments for both, the full 5-class problem, trying to distinguish between galaxy classes 1, 2, 3, 5 and 7 (cf. Table 2) and a 2-classprobleminwhichtheLBSareclassiﬁedagainstgalaxiesfrom theother four classes.Inaddition tothe singlecatalogue experi-ments,we also assessfeature relevances anddiscriminability be-tweenclassesforaconcatenationofall catalogues,toaccountfor possiblesynergiesbetweenfeaturesfromdifferentcatalogues.

Toallow forinterpretation inthe light ofother classiﬁers,we perform the same experiments with the widely used Random Forests(RF)classiﬁer[10]asabaseline.

(6)

4.1.Setup

WetrainandevaluateGMLVQonthegalaxycataloguedata us-ing a publicly available implementation [27].As the GMLVQ cost functionis implicitlybiasedtowards classeswith largernumbers ofsamples, we train and evaluate the classifier on size-balanced randomsubsetsofthefiveclasses.Forourexperiments,wespecify oneprototypeperclassandrunthealgorithmfor100batch gradi-entstepswithstepsizeadaptationasrealisedin[27]withdefault parameter settings.the We validate the algorithm by performing aclass-balanced repeatedrandom sub-sampling validation(see e.g. [31]forvalidationmethods)foratotalof10runs.Errormeasures andrelevance profiles shown in the following correspond to av-eragesoverthe10repetitions.Forthetwo-classproblemswealso obtainandaverageReceiverOperatorCharacteristics(ROC)andthe correspondingAreaundertheCurve(AUC)[32].

4.1.1. SetupLBSvsothers

Forthetwo-classproblem,weevaluatetheclassiﬁerona sub-setofthefulldataset(cf.Section2.6)containing515samples.For thissubset,weselectall259samplesfromclass2,whilethe oth-ers class is made up by 256 samples consisting of 64 samples randomly selected fromclass 1, 3, 5,and 7 each. The remaining settings andvalidation procedure remain identical to the 5-class problem.

4.1.2. RandomForests

We executeexperimentsemployingRandomForestsanalogous to theGMLVQ experiments,i.e. the classiﬁer is trainedon class-balancedrandomsubsetsofthedataandvalidatedusingrepeated random sub-sampling validation. Experiments are performed us-ing a publicly available scikit-learn implementation [29,30] with defaultsettings.

4.2.Classiﬁcationresultsbasedonparametersfromindividual catalogues

A summary ofclassification performancesforboth the5-class andthe 2-class problem can be found in Fig. 1. For the 5-class problem, an overview of confusion matrices (averaged over all validationruns)foreachofthecataloguesisshowninFig.1a;an overviewofthe average classification accuraciescan be found in Fig.1cinthebottompanel.Forthe2-classproblem,acomparison ofROCcurvesandclassificationaccuraciescanbefoundinFig.1b andin Fig.1c inthe top right subfigure, respectively. The corre-spondingaveragerelevanceprofiles contrastingfeature relevances forthe5-classand2-classproblemareshownintheAppendix,in Fig. B.1 (Lambdar catalogue), Fig. B.2 (GaussFitSimple catalogue), Fig. B.3 (SersicCatVIKING catalogue), Fig. B.4 (SersicCatUKIDSS catalogue),andFigs.B.5andB.6(MagPhyscatalogue).

Resultsbasedon SersicCatVIKING:The confusionmatrix indicat-ing the GMLVQ class-wise accuracy on the SersicCatVIKING cat-alogue exhibits similar, albeit slightly worse performance than the performances presented in our previous work [3] that was basedon a different setof galaxyparameters. Based on the Ser-sicCatVIKING,theLBS areclassifiedwithhigheraccuracy(87%vs. 91%inESANN)thantheotherclasses(47–67%,64–74%).Asinthe ESANNresults,classes1and3showsomeoverlap(21%ofclass1 samplesare classified asclass 3,and 20% ofclass 3 samplesare erroneouslyclassified as class 1). However, unlike in the ESANN results, the overlap between class 1 and class 2 is increased in the classification using SersicCatVIKING: 22% of class 1 samples arenowclassifiedasbelongingtoclass2,wherethisoverlapwas only 10% for the data analysed in our ESANN contribution [3]. Thisis also reflected inthe 2-class problemwhen distinguishing theLBS fromthe other classes.In [3] thiscan be achieved with

AUC(ROC)=0.96, whilefortheSersicCatVIKINGcataloguethe clas-siﬁcationaccuracyisaround84%andtheAUC(ROC)=0.91.Another notable increase inoverlap isthe overlapbetweenclass 5and7, wherethemisclassiﬁcationrateofclass5galaxiesasclass7 galax-iesisincreasedfrom8%to18%.

Results based on GaussFitSimple catalogue: The confusion ma-trix for the classification based on the GaussFitSimple Catalogue showsthehighestclassificationaccuracyof64%fortheLBS.Class 3dropsinaccuracyto47%.Thisisinpartduetoanincreased over-lap betweenthe classes, 31%of class 1 samples are classifiedas class 3samplesand31% ofclass3 samplesasbelongingto class 1. In addition, there is increased overlap between class 1 and 5 (12%)andclass3 and5(18%),while theoverlapbetweenclasses 1and3withboth LBSandclass7remains low.Itisnotable that basedontheinformationintheGaussFitSimpleCatalogue,class7 isonlyclassifiedslighlyabovechancelevel,withmostofits sam-plesbeingmisclassifiedasclass2(35%)andclass5(18%).Despite this,thedistinctionbetweenLBSandothersisstillonaverage78% correct,theAUC(ROC)=81%.

Results based on SersicCatUKIDSS: The results for the Sersic-CatUKIDSS show an overall similar performance to theresults of theSersicCatVIKINGcatalogue:In comparisonto theclassification performance presented in our ESANN contribution [3], there is an increased misclassification of class 1samples as class 2 sam-ples, andan increasedmisclassification ofclass 5samples as be-longing to class 7. LBS classification accuracy is at 87% with an AUC(ROC)=0.91.

ResultsbasedonLambdarcatalogue:TheresultsfortheLambdar sampleshowasimilarpictureastheGaussFitSimplesample:Class 7isclassifiedwithanaccuracyofonlyslightlyabovechancelevel andis often (52%) misclassifiedasclass 2. Unlikein the GFS re-sults,theaccuracyforclass 1isbelowchancelevel(15%). Ashas been the caseforthe other catalogues,class 1 samples are mis-classifiedmostlyasclass3(38%).IncontrasttotheGaussFitSimple catalogue,hereclass1alsoshowsconsiderableoverlapwithclass 2(23%ofclass1samplesaremisclassifiedasclass2).Inaddition, a considerableamountof class1 samples(11% and13%)are also misclassifiedasclasses5 and7.Further,class5 andclass 3show overlap,with15–16%misclassifications.Overall,classification accu-racybasedontheLambdarcatalogueislowest(46%),whiletheLBS canbedistinguishedwith74%accuracyandanAUC(ROC)=0.81.

ResultsbasedonMagPhyscatalogue:Theclassificationresultsfor theMagPhyssampleshow asimilartrendastheresultsbasedon theLambdarsample:Classes1and3exhibitconsiderableoverlap (40%ofclass 1samplesare classifiedasclass3,and17% ofclass 3samples areclassifiedasclass 1),class 7accuracy islow (43%) andisfrequentlymisclassifiedasclass2(34%ofthecases).In con-trast totheLambdar sample, there isalmost nooverlapbetween class1andclass2.Averageclassificationaccuracyforthe5classes basedontheMagPhyscatalogueisat(54%),whiletheLBS canbe distinguishedwith80%accuracyandanAUC(ROC)=0.88.

LBS vs other: The LBS can be distinguished from the other classes with an intermediate accuracy of about 74–87% and AUC(ROC)valuesof81–91%.

4.3. Combinedcatalogues

Combining all catalogues would result in a very high-dimensional classification problem, thereby rendering the result-ing relevance profiles difficult to interpret.We therefore select a subsetofparametersfromeachindividualcataloguebasedonthe featurerelevancesobtainedinthesinglecatalogueexperimentsin the following manner: Foreach individual catalogue,parameters aresortedaccordingtotheirrelevance.Subsequently,themost rel-evantparameterscumulativelycomprising50%ofthesummed to-talrelevancearecarriedovertothecombinedcatalogue.Wenote

(7)

(8)

thatwehavealsoperformedGMLVQexperimentsonthefull cata-loguecomprisingall377features,whichresultedinsimilar,albeit slightlyworseperformancesthanreportedbelow.

FortheRandomForestsbaselineexperiments,weselectthefull catalogueof377featuresindependentfromtheGMLVQresults,as towarrantidenticalexperimentalconditions.Forcompleteness,we note that classiﬁcation accuracyof RandomForests onthe above described relevance-selected parameter subset is comparable to theclassiﬁcationaccuracyonthefulldataset.

Sortedrelevance-proﬁlesfortheresultingcombinedcatalogues are displayed in Fig. 2a andb, for the 5-class and 2-class prob-lem, respectively. To simplify comparison, the confusion matrix as well as the 2-class classiﬁcation performance are displayed alongsidetheindividualcatalogueperformancesinFig.1.

Considering the confusionmatrix forthe combined catalogue, aslightoverall increase inperformance withrespect to the indi-vidualcatalogueperformancescanbeobserved.Further,itreflects thecombined propertiesofthe individual catalogues: An overlap between classes 1 and 3, some overlap between classes 3 and 5,and some overlap between classes 2 and 7.In comparisonto theresults presentedin [3], classification accuracyis slightly de-creased(70%vs.73%).Itshouldbenotedhowever,thatin[3]thrice as many samples per class were available, which could account forthedifference in performance. LBS can be distinguishedfrom the other classes with a classification accuracy of 89% and an AUC(ROC)=0.96.

Feature relevances for the combined catalogues: The param-eters that make up 50% of the relevances for the 5-class and the 2-class problem (indicated by a black arrow in Fig. 2a and b), almost exclusively originate from the SersicCatVIKING and MagPhys catalogues. For the 5-class problems, these parameters are related to stellar masses and dust (mass_stellar_best_ﬁt, mass_dust_percentile97_5, mass_stellar_percentile_97_5 and mass_stellar_percentile84), and the star formation timescale (gama_percentile16), the effective surface brightness within the half-light radius for the J- and Z-bands (GALMUEAVG_J and GALMUEAVG_Z), ellipticity of the galaxy (GALELLIP_Z, GALEL-LIP_Yviking), and magnitude of a GALFIT model of the galaxy (GALMAG10RE_Jviking).

For the 2-class problem, the most relevant parameters encompass the GALFIT central surface brightness in Z-band (GALMU0_Z), parameters related to star formation rates (sfr19_percentile50), information related to the ellipticity of thegalaxies (GALELLIPERR_Z,GALELLIP_Hviking), effective surface brightness(GALMUEAVG_Z)andinformation aboutthe equivalent widthofthesulphuremissionline.

It should be notedthat relevance-matrices are not necessarily unique.Theydependonwhichotherfeaturesareavailableandon theparameterschosen forboth datapreprocessingandexecution ofthealgorithm. Thiscan be illustrated whenconsidering highly correlatedvariables: GMLVQmightassigneithertwointermediate relevancesto each of the variables, or deem one variable highly relevant at expense of the other correlated variable’s relevance. Relevanceproﬁlesthereforeshouldbeinterpretedinthesensethat focusingonthemostrelevantparameterswouldallow differentia-tionbetweenclasseswiththereportedaccuracy,whilekeepingin mindthatothercombinationsoffeaturesmayachievethisaswell.

4.4.RandomForestsbaselineresults

The classification accuracies forRandom Forests for the indi-vidualand combinedcatalogues are displayed inFig. 1c side-by-sidewiththeGMLVQresults.Forallcataloguesapplyingthe Ran-domForestclassifier resultsincomparable, thoughslightlybetter classificationaccuracies.

5. Discussionandconclusion

The resultspresentedabove suggest that theremay be incon-sistenciesintheinvestigatedmorphologicalclassification scheme: Analogous to our previous findings [3], it has proven difficult to distinguishgalaxytypesusingtwopowerfulandflexibleclassifiers, GMLVQ and Random Forests.In all GMLVQ analyses ofthe indi-vidual as well asof the combined catalogues, class 1 (Ellipticals) and3 (Early-type spirals) are particularly difficultto differentiate. Class7(Late-typespirals&Irregulars)isfrequentlymisclassifiedas class 5(Intermediate-type spirals) andwitha similarfrequencyas class2(LBS),whileclass2isconsistentlydetectedwiththehighest sensitivityamongallclasses.

The difficulty of training a successful classifier was also ob-served in [9], where class-wise averaged accuracies are around 75%.As mentionedinour earliercontribution [3],possible expla-nationsforpoorclassificationperformancemaybethelackof dis-criminativepoweroftheemployed classifiersormis-labellingsof certain galaxies [9].A possibleindication forthe lattercasemay bethatsamplesfromclass7(Late-typespirals&Irregulars) are of-tenmisclassified asclass5 (Intermediate-type spirals),andclass 2 (LBS).Thisindicates thatthefeaturerepresentationsofthe galax-iesinquestionsharemorepropertieswiththenamedclasses,and itisnotunlikelythatinthehand-labellingprocessan Intermediate-type spiral is occasionally misclassified as class 7 (e.g. confused with a Late-typespiral), orthat a LBS is classified asclass 7 (an

Irregular).Intheformercase,employingevenmoreflexible classi-fiers,e.g. GMLVQ withlocalrelevance matrices[7], mayimprove classification performances. In the second case, if mis-labellings arerestrictedto“neighbouring” classes inanassumedunderlying classordering(e.g.whenconsideringclass5adjacenttoclass7,or class1(Ellipticals)asadjacenttoclass3(Early-typespirals)),ordinal classificationmayprovidefurtherinsights[33,34].

Despitetryingtoaddresstheissueofessentialparameters be-ing not contained in the dataset analysed in [3] by considering 5additionalcatalogues withamultitudeofphotometric, spectro-scopic and morphological measurements, it is still possible that additional (and possibly not yet discovered) parameters would enable improved class distinction. Yet, our results do not rule out the possibility that the true, underlying groupingof galaxies is considerably different and lessclear-cut than the investigated one. Further data-driven analyses of galaxy parameters and im-ages with advanced clustering methods might reveal alternative groupings,likerecentlyfoundfordataintheVIMOSPublic Extra-galactic Redshift Survey [35],or even suggest novel classiﬁcation schemes.

Toaid furtherinsight intothe nature ofthe employed visual-basedclassification scheme,inparticularwithrespect tophysical parameters,wehavepresentedrelevancesofthecataloguefeatures fortheinvestigatedclassdistinctions.Notethatrelevanceshaveto be interpretedwithregardto thecharacteristicsofthedata sam-ple(e.g.correlations)andclassificationperformance.Thisconnotes that featurerelevancesareonlymeaningfulwhen theclassof in-terest is at least moderately well distinguished from the others. Further it should be noted that the presented feature relevances are not necessarily unique – alternative relevance solutions may exist.Itisofparticularinterestto notethat inthecombined cat-alogue the mostrelevant features originate from the Sérsic cata-logues and the MagPhys catalogue. The high relevance of Sérsic features indicate the importance of galaxy structure in different bands forthe class distinction, while the presence of highly rel-evantfeaturesfromtheMagPhyscataloguehighlightsthat classifi-cationperformance isaided bythesephysicalparametersaswell. Further insight intothe role offeatures inthe context of neces-saryanddispensablefeaturesmaybeobtainedbystudyingfeature relevanceboundsalongthelinesof[36].

(9)

Fig. 2. Sorted relevance proﬁles for catalogues obtained by combining the most relevant features that cumulatively make up 50% of the relevances in the single catalogue relevance proﬁles. Bar colours indicate the origin catalogue for each feature. Features up to the position marked by a black arrow constitute 50% of the cumulative relevance determined for the resulting combined catalogue.

(10)

Conclusions:Wehavepresentedananalysisoffivegalaxy cata-loguesusingRandomForestsandGMLVQ,aprototype-based clas-sifier.Analogoustoresultsobtainedinprecedingworkona lower-dimensional dataset, we conclude that even when considering a multitudeofadditionalgalaxydescriptors,thevisual-based classi-ficationschemeusedtolabelthegalaxysampleremains notfully supportedby theavailable data.Taking intoaccountthat percep-tual andconceptual biaseslikely play non-negligible roles inthe creation andapplication of galaxy classification schemes, further data-drivenanalyses might help provide novelinsights regarding thetrueunderlyinggroupingofgalaxies.

Acknowledgements

GAMA isajointEuropean-Australasianprojectbased arounda spectroscopiccampaignusingthe Anglo-AustralianTelescope.The GAMAinputcatalogueisbasedondatatakenfromtheSloan Dig-italSky Survey andtheUKIRT InfraredDeepSky Survey. Comple-mentaryimagingoftheGAMAregionsisbeingobtainedbya num-ber of in- dependent survey programmes including GALEX MIS, VSTKiDS,VISTAVIKING,WISE,Herschel-ATLAS,GMRTandASKAP providingUVtoradiocoverage.GAMAisfundedbytheSTFC(UK), theARC(Australia),theAAO,andtheparticipatinginstitutions.The GAMAwebsiteishttp://www.gama-survey.org/.

WethankSreevarshaSreejith,LeeKelvinandAngusWrightfor helpfulfeedbackanddiscussionsandtheanonymousreviewersfor feedbackwhichhelpedusimprovethemanuscript.

A. Nolte and M. Biehl acknowledge ﬁnancial support by the EUs Horizon 2020 research and innovation programme under MarieSkłodowska-Curiegrant agreementNo.721463to the SUN-DIALITN network.M.Bilickiis supportedby theNetherlands Or-ganisation for Scientiﬁc Research, NWO, through grant number 614.001.451andbythePolishMinistryofScienceandHigher Edu-cationthroughgrantDIR/WK/2018/12.

Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.neucom.2018.12.076.

AppendixA. Datasetvisualisationsandintrinsicdimensionality

reductioninGMLVQ

Figs.A.1andA.2displayprojectionsofeachdatasetconsidered in this work onto the ﬁrst and second eigenvector of the rele-vance matrix

(cf. Section 3) and onto the ﬁrst two principal

components determined by Principal Component Analysis (PCA) [37].Therightmostcolumnofeachﬁgurecontraststheeigenvalue spectra of

and the data covariance matrix which forms the basis forPCA. While

is an n_{× n} matrix, the steeply declining eigenvaluespectraforeach datasetillustratethe low-dimensional subspace which GMLVQ operates in after learning [25,26]. In particular, for the 5 class problem,

spans an approximately 3 dimensionalsubspace,whileforthe2classproblemthesubspace is essentially one-dimensional. The low-rank relevance matrices therefore can be thought of as performing a GMLVQ-intrinsic dimensionalityreduction.

Comparingthe2-Dprojectionsontothetwo leading eigenvec-torsof

andtheprojectionsontotheﬁrsttwoprincipal compo-nents,theformerresultsinamorefannedoutrepresentationwith respectthe classes.Thisisdueto thefact that bymaking useof the class labels,GMLVQ ﬁnds a lower-dimensionaldiscriminative subspaceasopposedtotheunsupervisedPCA.

AppendixB.Featurerelevancesforindividualcatalogues

Inthefollowing(Figs.B.1–B.6),wepresentrelevanceprofilesfor theindividualcataloguesanalysedinthiswork.Relevanceprofiles reflect the diagonal of GMLVQ’s relevance matrix

after learn-ing(cf.Section3)andsummarisetheimportanceoffeaturesfora givendatasampleandclassificationtask.Figuresdisplaymeanand varianceoftheprofilesover10independentruns(cf.Section4.1). Asnotedpreviously,foran accurateinterpretation itisimportant to note that, ingeneral, relevance profiles are not unique: Espe-cially in the presence of correlated variables, alternative profiles resultingin comparableclassificationperformance might exist.In particular,a feature’slowrelevance doesnotentail thefeatureto carrynoinformationforthedesiredclassdistinction,butmay in-steadindicateitscontributiontobeatleastpartlyredundantwith otherfeatures.

Forexample,contrarytoexpectationsatfirstglance,our exper-imentswiththe Lambdar sampleresultinrelevance profiles that indicateuncertaintiesoffluxesofvarious bandsasmorerelevant than the corresponding flux measurements themselves (Fig. B.1). While it is not unthinkable that flux uncertainties systematically varyoverasubsetofgalaxyclasses(personalcommunication, An-gus Wright,developer of theLAMBDAR software), in oursample W1andW2fluxesarecorrelatedwithboththeirrespectiveerrors andwithfluxes fromother bands.W1 andW2 fluxesaswell as fluxesfromotherbandsarethusatleastpartlyredundantwiththe W1 andW2flux uncertainties,andthereforemightendup more relevantthanthecorrespondingfluxes.

(11)

Fig. A.1. 2D visualisations of the datasets used in the LBS vs. others classification condition. The leftmost column displays a projection of each dataset onto the first two eigenvectors of the learned relevance matrix . In the middle column, projections of the datasets onto the first two principal components (PC1 and PC2) are shown. The right column juxtaposes the eigenvalue spectra of the relevance matrix and the data covariance matrix used in PCA. For increased readability, figures concentrate on the median region of the data and axes are cut off at a 3 times inter-quantile range distance from the median. Furthermore, the data projections are scaled by the square root of the corresponding eigenvalues. In the sub-figures of the eigenvalue spectra the x-axis is truncated after both eigenvalues have dropped below a value of 0.005.

(12)

Fig. A.2. 2D visualisations of the datasets used in the 5-class classification condition. The leftmost column displays a projection of each dataset onto the first two eigenvectors of the learned relevance matrix . In the middle column, projections of the datasets onto the first two principal components (PC1 and PC2) are shown. The right column juxtaposes the eigenvalue spectra of the relevance matrix and the data covariance matrix used in PCA. For increased readability, figures concentrate on the median region of the data and axes are cut off at a 3 times inter-quantile range distance from the median. Furthermore, the data projections are scaled by the square root of the corresponding eigenvalues. In the sub-figures of the eigenvalue spectra the x-axis is truncated after both eigenvalues have dropped below a value of 0.005.

(13)

Fig. B.1. Feature relevances as determined by GMLVQ for the Lambdar sample. For accurate interpretation of the relevance proﬁles, take note that relevance proﬁles are not necessarily unique, in particular in the presence of highly correlated variables. This connotes that focusing on the relevant parameters would enable to differentiate between classes with the reported accuracy, however, there may be other combinations of features which could result in similar accuracies.

(14)

(15)

(16)

(17)

(18)

(19)

References

[1] R.J. Buta, Galaxy morphology, in: T.D. Oswalt, W.C. Keel (Eds.), Plan- ets, Stars and Stellar Systems: Volume 6: Extragalactic Astronomy and Cosmology, Springer Netherlands, Dordrecht, 2013, pp. 1–89, doi: 10.1007/ 978- 94- 007- 5609- 0 _ 1 .

[2] H. Mo , F. Van den Bosch , S. White , Galaxy Formation and Evolution, Cambridge University Press, 2010 .

[3] A. Nolte , L. Wang , M. Biehl ,Prototype-based analysis of GAMA galaxy catalogue data, in: M. Verleysen (Ed.), Proceedings of the 26th European Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Machine Learning, Ciaco, 2018, pp. 339–344 . i6doc.com

[4] L. Kelvin , S. Driver , A.S. Robotham , A.W. Graham , S. Phillipps , N.K. Agius , M. Al- paslan , I. Baldry , S. Bamford , J. Bland-Hawthorn , et al. , Galaxy And Mass As- sembly (GAMA): ugrizYJHK Sérsic luminosity functions and the cosmic spectral energy distribution by Hubble type, Mon. Not. R. Astron. Soc. 439 (2) (2014) 1245–1269 .

[5] S.P. Driver, P. Norberg, I.K. Baldry, S.P. Bamford, A.M. Hopkins, J. Liske, J. Loveday, J.A. Peacock, D.T. Hill, L.S. Kelvin, A.S.G. Robotham, N.J.G. Cross, H.R. Parkinson, M. Prescott, C.J. Conselice, L. Dunne, S. Brough, H. Jones, R.G. Sharp, E. van Kampen, S. Oliver, I.G. Roseboom, J. Bland-Hawthorn, S.M. Croom, S. Ellis, E. Cameron, S. Cole, C.S. Frenk, W.J. Couch, A.W. Graham, R. Proctor, R. De Propris, I.F. Doyle, E.M. Edmondson, R.C. Nichol, D. Thomas, S.A. Eales, M.J. Jarvis, K. Kuijken, O. Lahav, B.F. Madore, M. Seibert, M.J. Meyer, L. Staveley-Smith, S. Phillipps, C.C. Popescu, A.E. Sansom, W.J. Sutherland, R.J. Tuffs, S.J. Warren, GAMA: towards a physical understanding of galaxy formation, Astron. Geophys. 50 (5) (2009) 5.12–5.19, doi: 10.1111/j.1468-4004. 2009.50512.x .

[6] T. Kohonen , The self-organizing map, Neurocomputing 21 (1) (1998) 1–6 .

[7] P. Schneider , M. Biehl , B. Hammer , Adaptive relevance matrices in Learning Vector Quantization, Neural Comput. 21 (12) (2009) 3532–3561 .

[8] M. Biehl , B. Hammer , T. Villmann , Prototype-based models in machine learning, Wiley Interdiscip. Rev.: Cogn. Sci. 7 (2) (2016) 92–111 .

[9] S. Sreejith , S. Pereverzyev Jr , L.S. Kelvin , F.R. Marleau , M. Haltmeier , J. Ebner , J. Bland-Hawthorn , S.P. Driver , A.W. Graham , B.W. Holwerda , et al. , Galaxy and mass assembly: automatic morphological classiﬁcation of galaxies using statistical learning, Mon. Not. R. Astron. Soc. 474 (4) (2017) 5232–5258 .

[10] L. Breiman , Random forests, Mach. Learn. 45 (1) (2001) 5–32 .

[11] J. Liske , I. Baldry , S. Driver , R. Tuffs , M. Alpaslan , E. Andrae , S. Brough , M. Clu- ver , M.W. Grootes , M. Gunawardhana , et al. , Galaxy And Mass Assembly (GAMA): end of survey report and data release 2, Mon. Not. R. Astron. Soc. 452 (2) (2015) 2087–2126 .

[12] S. Turner , L.S. Kelvin , I.K. Baldry , P.J. Lisboa , S.N. Longmore , C.A. Collins , B.W. Holwerda , A.M. Hopkins , J. Liske , Reproducible k-means clustering in galaxy feature data from the GAMA survey, Mon. Not. R. Astron. Soc. 482 (1) (2018) 126–150 .

[13] Y.A. Gordon , M.S. Owers , K.A. Pimbblet , S.M. Croom , M. Alpaslan , I.K. Baldry , S. Brough , M.J. Brown , M.E. Cluver , C.J. Conselice , et al. , Galaxy and Mass As- sembly (GAMA): active galactic nuclei in pairs of galaxies, Mon. Not. R. Astron. Soc. 465 (3) (2016) 2671–2686 .

[14] A . Wright , A . Robotham , N. Bourne , S. Driver , L. Dunne , S. Maddox , M. Alpaslan , S. Andrews , A. Bauer , J. Bland-Hawthorn , et al. , Galaxy and mass assembly: accurate panchromatic photometry from optical priors using LAMBDAR, Mon. Not. R. Astron. Soc. 460 (1) (2016) 765–801 .

[15] D.G. York , J. Adelman , J.E. Anderson Jr , S.F. Anderson , J. Annis , N.A. Bahcall , J. Bakken , R. Barkhouser , S. Bastian , E. Berman , et al. , The Sloan digital sky survey: technical summary, Astron. J. 120 (3) (20 0 0) 1579 .

[16] A. Edge , W. Sutherland , K. Kuijken , S. Driver , R. McMahon , S. Eales , J.P. Emer- son , The VISTA Kilo-degree Infrared Galaxy (VIKING) Survey: Bridging the Gap between Low and High Redshift, The Messenger 154 (2013) 32–34 .

[17] E.L. Wright , P.R. Eisenhardt , A.K. Mainzer , M.E. Ressler , R.M. Cutri , T. Jarrett , J.D. Kirkpatrick , D. Padgett , R.S. McMillan , M. Skrutskie , et al. , The Wide-ﬁeld Infrared Survey Explorer (WISE): mission description and initial on-orbit performance, Astron. J. 140 (6) (2010) 1868 .

[18] E. Da Cunha , S. Charlot , D. Elbaz , A simple model to interpret the ultraviolet, optical and infrared emission from galaxies, Mon. Not. R. Astron. Soc. 388 (4) (2008) 1595–1617 .

[19] E. da Cunha, and S. Charlot. MAGPHYS: Multi-wavelength Analysis of Galaxy Physical Properties. 2011. http://adsabs.harvard.edu/abs/2011ascl.soft06010D . [20] E. da Cunha, S. Charlot, MAGPHYS - Multi-wavelength Analysis of Galaxy

Physical Properties, accessed: 2018-07-10 http://www.iap.fr/magphys/ ewExternalFiles/readme.pdf .

[21] L.S. Kelvin , S.P. Driver , A.S. Robotham , D.T. Hill , M. Alpaslan , I.K. Baldry , S.P. Bamford , J. Bland-Hawthorn , S. Brough , A.W. Graham , et al. , Galaxy And Mass Assembly (GAMA): structural investigation of galaxies via model analysis, Mon. Not. R. Astron. Soc. 421 (2) (2012) 1007–1039 .

[22] C.Y. Peng , L.C. Ho , C.D. Impey , H.-W. Rix , Detailed structural decomposition of galaxy images, Astron. J. 124 (1) (2002) 266 .

[23] A. Lawrence, S.J. Warren, O. Almaini, A.C. Edge, N.C. Hambly, R.F. Jameson, P. Lucas, M. Casali, A. Adamson, S. Dye, J.P. Emerson, S. Foucaud, P. Hewett, P. Hirst, S.T. Hodgkin, M.J. Irwin, N. Lodieu, R.G. McMahon, C. Simpson, I. Smail, D. Mortlock, M. Folger, The UKIRT Infrared Deep Sky Survey (UKIDSS), Mon. Not. R. Astron. Soc. 379 (2007) 1599–1617, doi: 10.1111/j.1365-2966.2007. 12040.x .

[24] T. Kohonen , Self-organizing Maps, Springer, 1997 .

[25] M. Biehl , B. Hammer , F.-M. Schleif , P. Schneider , T. Villmann , Stationarity of matrix relevance LVQ, in: Proceedings of the 2014 International Joint Confer- ence on Neural Networks (IJCNN), IEEE, 2014, pp. 1–8 .

[26] M. Biehl , K. Bunte , F.-M. Schleif , P. Schneider , T. Villmann , Large margin linear discriminative visualization by matrix relevance learning, in: Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012, pp. 1873–1880 .

[27] M. Biehl, No-nonsense GMLVQ Demo Code, Version 2.3, accessed: 2018-07-01

http://www.cs.rug.nl/ ∼_biehl/gmlvq_.

[28] L. Breiman , Classiﬁcation and Regression Trees, Routledge, 1984 .

[29] F. Pedregosa , G. Varoquaux , A. Gramfort , V. Michel , B. Thirion , O. Grisel , M. Blondel , P. Prettenhofer , R. Weiss , V. Dubourg , J. Vanderplas , A. Passos , D. Cournapeau , M. Brucher , M. Perrot , E. Duchesnay , Scikit-learn: machine learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830 .

[30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, E. Duchesnay, Random forests in scikit-learn v0.20.0, settings taken from v0.22.0, accessed: 2018-10-15 http://scikit-learn. org/stable/modules/generated/sklearn.ensemble.RandomForestClassiﬁer.html #sklearn.ensemble.RandomForestClassiﬁer, http://scikit- learn.org/stable/ modules/ensemble.html#random-forests .

[31] T. Hastie , R. Tibshirani , J. Friedman , The Elements of Statistical Learning, Vol. 1, Springer-Verlag, Springer Series in Statistics, New York, NY, USA, 2001 .

[32] T. Fawcett , An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874 .

[33] S. Fouad , P. Tino , Adaptive metric Learning Vector Quantization for ordinal classiﬁcation, Neural Comput. 24 (11) (2012) 2825–2851 .

[34] F. Tang , P. Ti ˇno , Ordinal regression based on Learning Vector Quantization, Neural Netw. 93 (2017) 76–88 .

[35] M. Siudek , K. Małek , A. Pollo , T. Krakowski , A. Iovino , M. Scodeggio , T. Moutard , G. Zamorani , L. Guzzo , B. Garilli , et al. , The VIMOS Public Extragalactic Redshift Survey (VIPERS). The complexity of galaxy populations at 0.4 < z < 1.3 re- vealed with unsupervised machine-learning algorithms, Astron. Astrophys. 617 (2018) A70 .

[36] C. Göpfert , L. Pfannschmidt , J.P. Göpfert , B. Hammer , Interpretation of linear classiﬁers by means of feature relevance bounds, Neurocomputing 298 (2018) 69–79 .

[37] I. Jolliffe, Principal component analysis, in: M. Lovric (Ed.), International En- cyclopedia of Statistical Science, Springer, Berlin, Heidelberg, 2011, pp. 1094– 1096, doi: 10.1007/978- 3- 642- 04898- 2 _ 455 .

Aleke Nolte is currently a Ph.D. student at the University of Groningen, The Netherlands. She is part of the Marie Skłodowska-Curie Innovative Training Network SUNDIAL, which thematically is situated at the intersection of machine learning and astronomy. Previously, she obtained a joint M.Sc. degree in Computational Neuroscience from Technische Universität Berlin and Humboldt-Universität zu Berlin, Germany, and a B.Sc. degree in Cognitive Sci- ence from the University of Osnabrück, Germany.

Lingyu Wang is a scientist at the Netherlands Institute for Space Research (SRON). She is also an assistant professor and Rosalind Franklin fellow at the University of Groningen, the Netherlands. She received her Ph.D. in As- trophysics from Imperial College London in the United Kingdom in 2009. Her research interests include galaxy formation and evolution, statistical analysis and methods, and applying machine learning techniques to large datasets in astronomy.

Maciej Bilicki received his Ph.D. in astrophysics in 2012 from the Nicolaus Copernicus Astronomical Centre in Warsaw, Poland. Earlier he obtained MSc in Mathemat- ics with Computer Science from the University of Lodz (2002) and in astronomy from the Nicolaus Coperni- cus University in Torun (2007), both in Poland. Between 2012–2015 he was a postdoctoral fellow at the De- partment of Astronomy, University of Cape Town, South Africa. Since 2015 he is a postdoctoral research assistant at the Leiden University Observatory, and since 2017 he also has a part-time position at the Astrophysics Division of the National Centre for Nuclear Research in Warsaw. His main research interests are in studying the large-scale structure of the Universe. Among the various methods applied for this purpose are machine-learning techniques, such as supervised learning for regression and classi-

(20)

ﬁcation, and he has led or been involved in several applications of such methodolo- gies to big astronomical data.

Benne Holwerda received his Ph.D. in Astronomy from the University of Groningen, the Netherlands in 2005. He is now an associate professor of astronomy at the Uni- versity of Louisville, after working at the Space Telescope Science Institute, the University of Cape Town, the Eu- ropean Space Agency and Leiden Observatory. His scien- tiﬁc interests lie in the evolution of galaxies, the role of dust and gas in galaxy evolution and appearance. His interest in machine learning techniques lie in their use to solve astronomical object deblending and the identiﬁca- tion of unique objects and populations in large astronomical data.

Michael Biehl received a Ph.D. in Physics from the Uni- versity of Gießen, Germany, in 1992 and completed a Habilitation in Theoretical Physics at the University of Würzburg, Germany, in 1996. He is currently Professor of Computer Science at the Bernoulli Institute for Math- ematics, Computer Science, and Artiﬁcial Intelligence at the University of Groningen, The Netherlands. His main research interest is in the theory, development and appli- cation of machine learning techniques, with recent em- phasis on prototype-based systems and adaptive distance measures. He has co-authored more than 180 publications in international journals and conference proceedings. Fur- ther information, re- and preprints etc. are available at

Galaxy classification: A machine learning analysis of GAMA catalogue data

University of Groningen