URBAN
3
URBAN
URBAN
WP1 Digital Surveying
5
H.264 encoding:
Motivation in URBAN – Rate Distortion
JPEG 2000
AVC
7
H.264 encoding:
Motivation in URBAN – Encoding Speed
0 2 4 6 8 10 12 41.68778109 39.17519643 mean PSNR (dB) s p e e d ( F P S )
H.264 encoding:
Motivation in URBAN – Quality Measurement
•
Quality assessment of the final 3d reconstruction based
on the decoded video
– Test on 20.000 images JPEG2000 @ 295KB and H.264 @ ~350 KB – Required accuracy 15 cm in 3D 12 11.3 error in 3D (cm) 6.96 5.97 error in 2D (cm) 1.141 1.126
reprojection error (pixels)
92000 101000
features tracks
H.264 JPEG 2000
9
H.264 encoding optimization steps
•
Initial version with
algorithmic tuning
– Intra 16x16 and Intra 4x4 modes only –
– same algorithm as the one used in the same algorithm as the one used in the quality test
quality test
•
Processor specific
optimizations
– SSE optimizations, Loop unrolling in CAVLC, Transform, Inv. Transform and Prediction
– Use of openMP for a dual core CPU: frames partitioned into two
independent slices
– measured on Intel Core 2 Duo CPU, 2.20GHz, 2GB RAM Speed-ups 2.45 2.75 4.89 3.15 6.15 0 1 2 3 4 5 6 7
CAVLC transform inverse
transform intra prediction Total encoder
Target: 12fps
Final version: 13-15fps
Conclusion for Task 1
•
Deliverables D113 and D114:
– H.264 Intra-only Baseline Profile
– Encodes URBAN images at ~14 fps > 12 fps (real-time) – Acceptable quality and bitrate
• lower than JPEG 2000 but still acceptable – GPU acceleration is not possible
• Too many dependencies
• quality degradation using a GPU with CUDA for intra prediction – Enables fast image decoding @ 20-25 fps
URBAN
WP1 Digital Surveying
Deliverable D122 : optimized implementation of the
selected components on a CPU-GPU hybrid system
•
Selected components
– Debayering
• (actually takes place in the van, see D111, D112, D113) – 3D reconstruction components:
• H.264 decoding
• Feature (corner) detector
• Feature matching with NCC (Normalized Cross Correlation)
decoding Feature detection
Feature matching
I(t-1)
I(t)
decoding Feature detection13
Hybrid CPU-GPU System
•
Hardware used
– CPU
• Intel Core i7 • 5.06 GHz
• 8 MB Intel smart cache • 4GB RAM – GPU • Nvidia’s GeForce GTX 280 • 1.3 GHz clock speed • 240 CUDA cores • 65535 threads • 1 GB global memory
• 16 KB shared memory per core
• Memory bandwidth 147 GB/sec • CPU-GPU bandwidth 1.4 GB/sec
1. H.264 decoding
•
Hardware acceleration (video processor)
– Nvidia GeForce GTX 280 supports Pure Video® HD
– Intra-coded bitstream decoding • 1080p @ 25-30fps
• 1632x1248 @ 20 fps
15
2. Feature (Corner) point detector
•
Choice of the algorithm
– Corner detectors
• state-of-the-art is based on mainly two papers: Harris and KLT • Motivation: matching problem
2. Feature (Corner) point detector
•
Low Complexity Harris detector
– Use integral image to calculate sum of pixels/gradients – Approximate the Gaussian filter by box filter
•
GPU vs. CPU: 10x speed-up
Speed-ups 13 36 10 16 7.1 13.9 12.42 0 5 10 15 20 25 30 35 40 Inte gral Im age Gra dien t Inte gral Gra dien t Cor nern ess NMS Tota l tim e tal t ime with out N MS
17
3. Feature matching
•
Test possible feature pairs
– Between stereo views at time t 4 image pairs
– Between each view at time t and t-1 8 image pairs
– Use a metric to compare their 9x9 neighborhoods
•
Normalized Cross Correlation (NCC)
– Spatial method
19
3. Feature matching
•
CUDA implementation
– Spatial method:
• Computationally expensive on CPU
• Can be efficiently run in parallel on GPU as a convolution • NCC can be estimated for many window pairs in parallel
– Spectral method:
• Approximation but lower complexity on CPU
• cuFFT library proposes optimized direct and inverse FFT
• However it is optimized for large window sizes (here we have 9x9)
• Several windows FFT transformations cannot be launched in parallel on the GPU
Feature matching:
CUDA implementation
•
Feature pairs selection
Region of interest Global
•
CUDA implementation
21
Summary: Pipeline
… Readback 12 feature pairs information H.264 decoding 8 frames OpenGL 20 fps/image GPU CPU 8 streams 1.4 GB/sec 1.4 GB/sec NCC t | t-1 NCC NCC NCC NCC NCC … 80 fps/image 12 feature matches 147 GB/sec 147 GB/sec Global memory 8 feature strips Corner detector Corner detector Corner detector Corner detector Corner detector Corner detector Corner detector Corner detector 70 fps/imageConclusion for Task 2
•
Deliverables D122 and D123
– CPU-GPU programming allows for significant acceleration but algorithmic tuning is mandatory
– Bottleneck: CPU-GPU communication
• The amount of data transferred between the CPU and GPU depends on the application
– CPU -> GPU : Optional information on regions of interest for features matches – GPU -> CPU : Optional readback of decoded images to the CPU
80 2.5 Feature matching 70 5.6 Feature detection 20 5.5 H.264 decoding GPU (fps) CPU (fps)
H.264 encoding optimization steps
•
Initial version with algorithmic tuning
– Intra 16x16 and Intra 4x4 modes only
–
– same algorithm as the one used in the quality testsame algorithm as the one used in the quality test
– measured on Intel Core 2 Duo CPU, 2.20GHz, 2GB RAM
4.29 fps
4.84 fps
25
H.264 encoding optimization steps
•
DTSE and Vectorization and Multi-threading on several cores
– SSE optimizations, Loop unrolling in CAVLC, Transform, Inv. Transform and Prediction
– Use of openMP for a dual core CPU: frames partitioned into two independent slices – measured on Intel Core 2 Duo CPU, 2.20GHz, 2GB RAM
13.7 fps
15.3 fps
2. Feature (Corner) point detector
•
Harris detector
1.
Compute x and y derivatives
of image filtered by a
gaussian Gx, Gy
2.
Compute product of
derivatives Gxx, Gxy, Gyy
3.
Compute weighted averages
of these products Sxx, Sxy,
Syy
4.
Compute the matrix
H =[Sxx, Sxy; Sxy, Syy]
and estimate cornerness
R = det(H)-k (trace(H))
2•
Low Complexity Harris det.
1.
Approximate the gaussian
derivative filter by the box filter
of the integral image and
compute G’x, G’y
2.
Compute product of derivatives
Gxx, Gxy, Gyy
3.
Compute the sums of these
products S’xx, S’xy, S’yy
4.
Compute the matrix
H =[S’xx, S’xy; S’xy, S’yy]
and estimate cornerness
R = det(H)-k (trace(H))2
P. Mainali, Q. Yang, G. Lafruit, R. Lauwereins and L. Van Gool, “LOCOCO: LOW COMPLEXITY CORNER DETECTOR”, submitted to ICASSP 2010.
27
Corner detector:
CUDA implementation
•
Speed-up: GPU vs. CPU
Speed-ups
13
36
10
16
7.1
13.9
12.42
0 5 10 15 20 25 30 35 40Integral Image Gradient Integral Gradient Cornerness NMS Total time Total time
Hybrid CPU-GPU programming
•
Comparison of a CPU and
GPU implementation
– CPU (Single CPU with no
hyperthreading) : optimization through data locality exploration (efficient data cache)
– GPU : optimization by
parallelization exploration (SIMD)
•
CUDA (NVIDIA)
– Compute Unified Device Architecture
29
Hybrid CPU-GPU programming
•
Memory Model
– Grid
• Global Memory
• Constant Memory (read only) • Texture Memory (read only) – Block • Shared Memory • Registers • Local Memory – SIMD parallelism • Hundreds of cores • 512 threads per core
CUDA vs Traditional GPGPU
•
Shared memory allows for user controlled data
cache management
31
Corner detector:
CUDA implementation
1.
Integral image
– Mostly sequential algorithm but …
1. Prefixed-sum parallel algorithm to compute the sum of rows 2. Transpose the result using shared memory and block pre-fetches 3. Re-operate the prefix-sum on the rows
– The transpose step is needed in order to optimize the memory reads
– CPU: 18.3 ms GPU: 1.41 ms
for 1632x1248 images – Speed-up = 13
Corner detector:
CUDA implementation
•
Pre-fixed sum parallel implementation
– Scan: Mark Harris, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA”. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007
33
Corner detector:
CUDA implementation
2.
Approximate gaussian derivative filter by the
box filter of the integral and compute Gx, Gy
• Boxfilters can be computed easily with CUDA. Options:
1. Compute from global memory directly with memory fetches at x+-4, y+-4 positions
2. Pre-store the input from the global memory into the shared memory for optimized reads
– CPU: 45.4 ms GPU: 1.26 ms (for two gradients) – Speed-up = 36
Corner detector:
CUDA implementation
3.
Compute the products Gxx, Gxy, Gyy
4.
Compute their sums Sxx, Sxy, Syy
• Combination of the two first kernels. Options:
1. Pre-compute Gxx, Gyy, Gxy, store them in global memory and then compute integral images of the results
2. Fuse both kernels by avoiding a pre-store of the three gradient products in global memory
• But bandwidth is high…
• Extra computations lower the optimality of the scan algorithm implementation by using more resources per thread
• Results are similar
• CPU: 75.1 ms GPU: 7.5 ms (three kernel launches) • Speed-up = 10
35
Corner detector:
CUDA implementation
5.
Evaluate the cornerness R from matrix H
• Pixel-wise evaluation of a simple expression from Sxx, Sxy, Syy • Optimized reads by coalescing the pointers
• CPU: 17.6 ms GPU: 1.1 ms • Speed-up = 16
6.
Non-maximum suppression
• Sequential algorithm but a Quick Sort can be launched in parallel using the Radix Sorting
• Nadathur Satish, Mark Harris, and Michael Garland. “Designing Efficient Sorting Algorithms for Manycore GPUs”. Proc. 23rd IEEE International Parallel & Distributed Processing
Symposium, May 2009.
– CPU: 22.7 ms GPU: 12.8 ms – Speed-up = 1.8
Feature matching:
CUDA implementation
•
Spatial method
– Computing the mean of the windows and can be done by • Pre-fixed sum
• Integral images value fetches
– Memory fetches are not optimized for CUDA
• Feature windows are generally not well aligned
• Storing the windows in shared memory does not bring enough data re-use + + + + +
37
Feature matching:
CUDA implementation
•
Spatial method
– Pre-center (subtract the mean) and align the feature neighborhoods in the feature detection kernel (integral image is already used)
– These feature neighborhood strips are stored in global memory for each image
– Minimal extra computation for the feature detection kernel
– GTX 280 global memory = 1GB enables to avoid many CPU-GPU communications + + + + + + + + + + + +
Corner detector:
CUDA implementation
•
Summary
CPU
39
Feature matching:
CUDA implementation
•
Feature pairs selection
– For each feature fi in image A
• NCC is computed between fi and gk in image B, with k
∈
[ 0, K] • Select the best correlated pair (fi , gk)• K can be fixed or dependent (e.g. features in a region of interest)
– CUDA implementation