• No results found

Hybrid CPU-GPU System

N/A
N/A
Protected

Academic year: 2021

Share "Hybrid CPU-GPU System"

Copied!
31
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)
(2)

URBAN URBAN

WP1&3: Mapping on Processors of Mixed Types

(3)

Hybrid CPU-GPU System

• Hardware platform

– CPU

• Intel Core i7

• 5.06 GHz

• 8 MB Intel smart cache

• 4GB RAM

– GPU

• Nvidia’s GeForce GTX 280 1 3 GH l k d

• 1.3 GHz clock speed

• 240 CUDA cores

• 65535 threads

• 1 GB global memory

• 16 KB shared memory per core

• 16 KB shared memory per core

• Memory bandwidth 147 GB/sec

• CPU-GPU bandwidth 1.4 GB/sec

• Compute capability 1.3

(4)

Main Achievements

• Design methodology + implementations

• WP1: Digital Surveying

– Task 1: Mapping of H 264 Intra-only baseline encoder Task 1: Mapping of H.264 Intra only baseline encoder – Task 2: Mapping of corner feature detector

• WP3: 3D City Modeling for Visualization

– Task 3: Mapping of H.264 baseline encoder

(5)

Main Achievements

• WP1: Digital Surveying

T k 1 M i f H 264 I t l b li d – Task 1: Mapping of H.264 Intra-only baseline encoder – Task 2: Mapping of corner feature detector

• WP3: 3D City Modeling for Visualization

• WP3: 3D City Modeling for Visualization

– Task 3: Mapping of H.264 baseline encoder

(6)

Mapping of H.264 Encoder on Multi-Core CPU

• Initial version with algorithmic tuning

Intra 16x16 and Intra 4x4 modes only

Speed-ups

6.15 6

7 Intra 16x16 and Intra 4x4 modes only

same algorithm as the one used in the same algorithm as the one used in the quality test

quality test

• Processor specific

2.45

2.75

4.89

3.15 2

3 4 5 6

optimizations

SSE optimizations, Loop unrolling in CAVLC, Transform, Inv. Transform and Prediction

0 1

CAVLC transform inverse transform

intra prediction

Total encoder Use of openMP for a dual core CPU:

frames partitioned into two independent slices

measured on Intel Core 2 Duo CPU, 2.20GHz, 2GB RAM

Target: 12 fps

B f ti i ti 6 8 f

,

Before optimization: 6-8 fps

Final version: 13-15 fps

(7)

Quality Measurement

• Quality assessment of the final 3d reconstruction based on the decoded video

on the decoded video

– Test on 20.000 images JPEG2000 @ 295KB and H.264 @ ~350 KB – Required accuracy 15 cm in 3Dqu d a u a y 5 3

JPEG 2000 H.264

features tracks 101000 92000

reprojection error (pixels) 1.126 1.141

error in 2D (cm) 5.97 6.96

error in 3D (cm) 11.3 12

Marginal Quality Degradation

(8)

Main Achievements

• WP1: Digital Surveying

T k 1 M i f H 264 I t l b li d – Task 1: Mapping of H.264 Intra-only baseline encoder – Task 2: Mapping of corner feature detector

• WP3: 3D City Modeling for Visualization

• WP3: 3D City Modeling for Visualization

– Task 3: Mapping of H.264 baseline encoder

(9)

Selected Components for Optimization

• 3D reconstruction components

– Feature (corner) detector

– Feature matching with NCC (Normalized Cross Correlation)

decoding Feature detection

Feature matching

I

(t-1)

g

I

(t) decoding Feature detection

(10)

Corner Feature Detector

• Low Complexity Harris detector

– Use integral image to calculate sum of pixels/gradients – Approximate the Gaussian filter by box filterpp y

GPU vs. CPU: 10x speed-up

Speed-ups

36

40

13 10

16

7 1

12.42 13.9

15 20 25 30 35

7.1

0 5 10

Image

radient

radient

erness NM

S

tal time

t NMS

Integral Im

Gra

Integral Gra

Corner

Total

Total time w ithout

P. Mainali, Q. Yang, G. Lafruit, R. Lauwereins and L. Van Gool, “LOCOCO: LOW COMPLEXITY CORNER DETECTOR”, submitted to ICASSP 2010.

(11)

Corner detector: Sample Image

(12)

Feature Matching

• Test possible feature pairs

Between stereo views at time t 4 image pairs

– Between stereo views at time t 4 image pairs

– Between each view at time t and t-1 8 image pairs

– Use a metric to compare their 9x9 neighborhoods

• Normalized Cross Correlation (NCC)

– Spatial method

(13)

Feature matching: CUDA implementation

• Feature pairs selection

Region of interest

Global Region of interest

Global

• CUDA implementation

• Sorting NCC scores done in shared memorySorting NCC scores done in shared memory

GPU vs. CPU: 10-50x speed-up

(14)

Summary: Pipeline

Readback

12 feature pairs information

8 frames

OpenGL

20 fps/image 80 fps/image

12 feature matches 8 feature strips

Corner 70 fps/image

NCC NCC NCC Corner

detector

Corner detector Corner

CPU

8 streams 147 GB/sec147 GB/sec

Global memory

detector

Corner detector

Corner

8 streams

1.4 GB/sec 1.4 GB/sec

147 GB/sec 147 GB/sec Corner

detector

Corner detector

H 264 t | t-1 GPU

NCC NCC NCC Corner

detector

Corner detector

H.264 decoding

t | t 1 GPU

(15)

Main Achievements

• WP1: Digital Surveying

T k 1 M i f H 264 I t l b li d – Task 1: Mapping of H.264 Intra-only baseline encoder – Task 2: Mapping of corner feature detector

• WP3: 3D City Modeling for Visualization

• WP3: 3D City Modeling for Visualization

– Task 3: Mapping of H.264 baseline encoder

(16)

Mapping of H.264 Baseline Encoder

• Encode video captured by omni-directional camera

camera

• Selected components: Full-search Motion Estimation

– All MB/sub-MB partitions – ¼ pel accuracy

(17)

Optimization Strategies

• Algorithm parallelization

Perform ME for entire image in parallel – Perform ME for entire image in parallel

• Minimize memory access bandwidth and latency

– Reduce access to global memory – Coalesced memory access

– ...

GPU vs. CPU:

Motion estimation: 30-40x speed-up Encoding time 3 4 speed p

Encoding time: 3-4x speed-up

(18)

Conclusions

• Methodology for mapping on hybrid CPU-GPU platform

platform

– Algorithm parallelization

– Minimize communication between CPU and GPU – Minimize memory access to off-chip memory

– Increase number of threads to hide memory latency

No holy grail for implementation

• No holy grail for implementation

– select components carefully – Trade-off + trial & error

• Follow the trend for new hardware/software

– OpenCL, 100-core CPU, ...

(19)
(20)

Hybrid CPU-GPU programming

• Comparison of a CPU and GPU implementation

GPU implementation

– CPU (Single CPU with no

hyperthreading) : optimization through data locality exploration (efficient data cache)

– GPU : optimization by

parallelization exploration (SIMD)

• CUDA (NVIDIA)

– Compute Unified Device Architecture

– “Extended C”

(21)

Hybrid CPU-GPU programming

• Memory Model

Grid – Grid

• Global Memory

• Constant Memory (read only)

• Texture Memory (read only)

• Texture Memory (read only) – Block

• Shared Memory

• Registers

• Registers

• Local Memory – SIMD parallelism

• Hundreds of cores

• Hundreds of cores

• 512 threads per core

(22)

CUDA vs Traditional GPGPU

• Shared memory allows for user controlled data cache management

cache management

– Example: N x M Convolution kernels

(23)

Corner detector:

CUDA implementation

1. Integral image

– Mostly sequential algorithm but …

1. Prefixed-sum parallel algorithm to compute the sum of rows 2. Transpose the result using shared memory and block pre-fetches 3 Re operate the prefix sum on the rows

3. Re-operate the prefix-sum on the rows

– The transpose step is needed in order to optimize the memory reads

– CPU: 18.3 ms GPU: 1.41 ms

for 1632x1248 images – Speed-up = 13

(24)

Corner detector:

CUDA implementation

• Pre-fixed sum parallel implementation

– Scan: Mark Harris, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA”. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007

Up-sweep (reduction) Down-sweep

(25)

Corner detector:

CUDA implementation

2. Approximate gaussian derivative filter by the box filter of the integral and compute Gx Gy box filter of the integral and compute Gx, Gy

• Boxfilters can be computed easily with CUDA. Options:

1. Compute from global memory directly with memory fetches at x+-4, y+-4 positions

2. Pre-store the input from the global memory into the shared memory for optimized reads

memory for optimized reads

– CPU: 45.4 ms GPU: 1.26 ms (for two gradients) – Speed-up = 36

(26)

Corner detector:

CUDA implementation

3. Compute the products Gxx, Gxy, Gyy 4 Compute their sums Sxx Sxy Syy

4. Compute their sums Sxx, Sxy, Syy

• Combination of the two first kernels. Options:

1. Pre-compute Gxx, Gyy, Gxy, store them in global memory and then compute integral images of the results

2. Fuse both kernels by avoiding a pre-store of the three gradient products in global memory

• But bandwidth is high…

• Extra computations lower the optimality of the scan algorithm implementation by using more resources per thread

• Results are similar

• CPU: 75.1 ms GPU: 7.5 ms (three kernel launches)

• Speed-up = 10

(27)

Corner detector:

CUDA implementation

5. Evaluate the cornerness R from matrix H

f f

• Pixel-wise evaluation of a simple expression from Sxx, Sxy, Syy

• Optimized reads by coalescing the pointers

• CPU: 17 6 ms GPU: 1 1 ms

• CPU: 17.6 ms GPU: 1.1 ms

• Speed-up = 16

6. Non-maximum suppression 6. Non maximum suppression

• Sequential algorithm but a Quick Sort can be launched in parallel using the Radix Sorting

Nadathur Satish, Mark Harris, and Michael Garland. “Designing Efficient Sorting Algorithms Nadathur Satish, Mark Harris, and Michael Garland. Designing Efficient Sorting Algorithms for Manycore GPUs”. Proc. 23rd IEEE International Parallel & Distributed Processing

Symposium, May 2009.

– CPU: 22.7 ms GPU: 12.8 ms – Speed-up = 1.8

(28)

Feature matching:

CUDA implementation

• Spatial method

– Computing the mean of the windows and can be done by

• Pre-fixed sum

• Integral images value fetches

– Memory fetches are not optimized for CUDA

• Feature windows are generally not well aligned

• Storing the windows in shared memory does not bring enough data re-use

+ +

+

+ +

+

+ +

(29)

Feature matching:

CUDA implementation

• Spatial method

– Pre-center (subtract the mean) and align the feature neighborhoods in the feature detection kernel (integral image is already used)

Th f t i hb h d t i t d i l b l f – These feature neighborhood strips are stored in global memory for

each image

– Minimal extra computation for the feature detection kernel

– GTX 280 global memory = 1GB enables to avoid many CPU-GPU communications

+ +

+

+ +

+

+ + + + + +

(30)

Corner detector:

CUDA implementation

• Summary

CPU

GPU

GPU

(31)

Feature matching:

CUDA implementation

• Feature pairs selection

For each feature f in image A – For each feature fi in image A

• NCC is computed between fi and gk in image B, with k [ 0, K]

• Select the best correlated pair (fi , gk)

• K can be fixed or dependent (e g features in a region of interest)

• K can be fixed or dependent (e.g. features in a region of interest)

– CUDA implementation

• Sorting NCC scores done in shared memory

• Sorting NCC scores done in shared memory

Referenties

GERELATEERDE DOCUMENTEN

In this study, a deep CNN-based regression model is proposed for extracting the spatial-contextual features from the VHR image to predict the PCA-based multiple

Our second prediction is that using other hand configuration in the test phase of the primary DSP task, the reaction times on the keying sequences will be slower.. In the

This was followed by a testing phase, where the secondary task condition (tone counting) changed because the tone was presented at a different point of the sequence.. There was

An integration method based on Fisher's inverse chi-square and another one based on linear combination of distance matrices were among the best methods and significantly

(0.5 punt) b) Wanneer we (u, v)-mapping toepassen (waarbij de texture een n x bij n y plaatje is dat is onderverdeeld in texels) kunnen we dat op verschillende manieren

A more direct question, such as how much players their social construction of Africa and African civil wars are affected by playing this game, would require a

If “a” is the distance between card and the grating and “r” is the distance between the hole and the light spot so we have. (

WP3: Task 3.3 Portable omni-directional video