• No results found

Data Validation Beyond Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Data Validation Beyond Big Data"

Copied!
26
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

University of Groningen

Data Validation Beyond Big Data Valentijn, Edwin A.

Published in:

VST in the Era of the Large Sky Surveys

DOI:

10.5281/zenodo.1303323

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Valentijn, E. A. (2018). Data Validation Beyond Big Data. In VST in the Era of the Large Sky Surveys: Proceedings of the conference held 5-8 June, 2018 in Naples, Italy (pp. 17)

https://doi.org/10.5281/zenodo.1303323

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Data validation beyond Big Data

Edwin A. Valentijn

Kapteyn Astronomical Institute

(3)

STORY LINES

- processing/archiving/distribution:

- AstroWISE- KiDs - Ou-Ext – Euclid - data validation:

- lineage - OU-Ext - Euclid- Facts and Fakes Sequence of hypes:

(4)

The Datacentric approach

local networks and distributed

• 1-100 Tbyte – Pbytes

• DPU- Distributed Processing Unit • Distributed Data server - files

• Data manage - database - metadata

2003 RUG-CIT

(5)

KiDS - ESO – OmegaCAM@VST MUSE - ESO - VLT

Lofar - LTA - Astron

Glimps - AI Handwritten text – Lifelines DNA Target Holding

-> Euclid - ESA -> Micado - ESO - ELT

Astro-WISE – Data federations

Distributed Information Systems - handling surveys

since 2003 - it works

(6)

all published

http://www.astro-wise.org Manuals & tutorials

http://www.rug.nl/target Target Consortium Experimental Astronomy - Vol. 35, 2013 All papers are online

Astroinformatics 2016 IAU symposium 325 Datafederations

(7)

KiDS Quality control DR1-DR2-DR3

(8)

Links as workhorse

in data federations

• Distributed Information Systems

– Users, computers, storage

• Processing and Quality control • Reproducable ( re-processing)

2018: Open Science - FAIR principles

Findable Accessable Interoperable Reproducable

The Universe as a spreadsheet ERCIM News 2006

AstroWISE Chaining to the Universe ADASS XVI ASP Conference Series, 15-18 October 2006in Tucson, Arizona, USA.

(9)

The universe as a spreadsheet

Target Diagram/Data lineage /backward chaining

++ programming - dependencies

QUER Y / INFORMA TIO N PROCESSING

(10)
(11)

Image courtesy of ESA

Euclid

ESA launch in May 2021

Euclid Archive System (EAS)

- data centric information system - many of the WISE concepts

- prototype uses Astro-WISE

- db hosted in the Euclid SDC-NL

(12)

Weak gravitational lensing as probe of dark matter

KiDS: < 100 106 redshifts

EUCLID: 1.5 109 redshifts - phot- z

Ground based data – OU-Ext Every galaxy has its own 4 PSFs

(13)

The strengths

KiDS/VIKING

KiDS

u,g,r,i

(14)

Credits: A. Szalay, T. Tyson #Astronomers

(15)
(16)

Distributed communities

acces-proces-calibrate-analyse

publish

Euclid:

o 1500 registered members and growing

o 200 laboratories/departments

o 16 countries contributing

(17)

Euclid Archive system – EAS – lay out

Data Processing System

EAS-DPS Science Archive SystemEAS-SAS

Distributed Storage System EAS-DSS

Integral part of SGS Euclid Common Data Model

Consortium Services

Part of ESA archives Science Exploration Data Model

Scientific Exploitation Services

Distributed storage system - storage nodes in each SDC and SOC

File storage with additional interfaces for Euclid services (cut-out, visualization) Data files storage Computing facilities SOC Data files storage Computing facilities SDC-FR Data files storage Computing facilities SDC-ES Data files storage Computing facilities SDC-NL Data files storage Computing facilities SDC-DE Data files storage Computing facilities SDC-UK Data files storage Computing facilities SDC-IT Data files storage Computing facilities SDC-FR Data files storage Computing facilities SDC-FI

(18)

Science Archive System SDC -Y Metadata storage Metadata Access Layer Consortium User Service Consortium Processing Service Metadata Ingest Service IAL DSS server DSS Storage Node Interface COORS Stora ge no de Stora ge no de Stora ge no de

Data Processing System

HPC SDC -X DSS server Metadata storage DSS server SDC -Z Euclid Data Model Euclid data model in XSD

(19)
(20)

Euclid-EXT: massive pixel volumes - distributed archives CADC: CFIS CEFCA: JEDIS IfA: Pan-STARRS SDC-NL: KiDS SDC-FR:LSST SDC-DE: DES

(21)

From KiDS to Euclid-EXT

Euclidization

Changing reference systems Astrometry- photometry

(22)

GaiaList AssociateLists (ID,expID,SID) SourceLists (expID,SID,xy,ADU,radecinit) AstromSolutions (expID,astromparams) CalibratedSourceLists (ID,expID,radec,mag) MasterStarTable (ID,radec,motion,SED,variability) PhotomSolutions (expID,photomparams) Stage 1 pipeline

Target diagram ( ++ dependencies) for OU-EXT – Euclid external data - stage 2-dynamic Euclidization

Stage 1: instrument detrending = pixel processing Stage 2: photom/astrom calibration = table handling

• built-in dynamical reference system: table data-lineage,

(23)

Beyond Big Data

• QC and re-processing – Kids Euclid FAIR • OU EXT > Billion – dynamic tables

All techniques go back to the source

Scientists and journalists- > Fact and Fakes Structured data and unstructured data

(24)
(25)

.

DATA VALIDATION

ML lineage builder Extreme Data lineage

Fact checker Pipeline FAIR ML ML checker Media scanner structured data unstructured data Media scanner Focus on domains ML Lineage builder

ML creates links (per se) multiple links/joins

Extreme Data lineage

Import results ML lineage builder AWE database

ML Checker

New component – optional Close the EDL – ML loop Replace the fiddeling in ML

(26)

conclusions

Next level is all about Data validation • check ML

• QC

• systematics in data sets

• OU-ext dynamic Euclidization

• unstructured data: ML + lineage

Almost all about going back to the source Facts and Fakes

Referenties

GERELATEERDE DOCUMENTEN

Based on these criteria, a shortlist of CSPs was made, and those were approached for interviews. The interview questions can be found in Appendix F. Each interview resulted in

General disadvantages of group profiles may involve, for instance, unjustified discrimination (for instance, when profiles contain sensitive characteristics like ethnicity or

For M = 2 (Figure 13) we can make use of the AssociationAspectStorage which uses the optimisation described in Section 2.2.1 and Section 8.4.7. Because an exact query in this case

Annotation_id refers to an ID in an Annotation table (Table 7) which contains all annotations for that ID and possibly a comment on it which is stored in the table Comments (see

Doordat het hier vooral gaat om teksten worden (veel) analyses door mid- del van text mining -technieken uitgevoerd. Met behulp van technieken wordt informatie uit

Opgemerkt moet worden dat de experts niet alleen AMF's hebben bepaald voor de verklarende variabelen in de APM's, maar voor alle wegkenmerken waarvan de experts vonden dat

Table 6.2 shows time constants for SH response in transmission for different incident intensities as extracted from numerical data fit of Figure 5.6. The intensities shown

Therefore a database system in a decision support systems needs a facility for version or configuration management The model-oriented approach has a disadvantage, namely