Try GPU computing with WebCL

GPU computing is a hot topic at the moment. GPU computing is the use of a GPU (graphics processing unit) as a co-processor to accelerate CPUs for general-purpose scientific and engineering computing. General-purpose computing on graphics processing units (GPGPU) is the means of using a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The GPU accelerates applications running on the CPU by offloading some of the compute-intensive and time consuming portions of the code. From a user’s perspective, the application runs faster because it’s using the massively parallel processing power of the GPU to boost performance. This is known as “heterogeneous” or “hybrid” computing. This is another shift towards multiple cores: CPU and GPU fusion.

There has been for several years been manufacturer specific APIs to use GPU for computing. Nvidia has CUDA, ATI/AMD has Stream and Microsoft has several technologies (DirectCompute, Appcelerator, C++ AMP).

OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL has a very strong support base, and is likely the platform of choice for developers on Intel, AMD, ARM, Apple, IBM, etc.

The WebCL working group is working to define a JavaScript binding to the Khronos OpenCL standard for heterogeneous parallel computing. WebCL will enable web applications to harness GPU and multi-core CPU parallel processing from within a Web browser, enabling significant acceleration of applications such as image and video processing and advanced physics for WebGL games.

Nokia and Samsung have implemented open source prototypes of WebCL. Nokia has WebCL prototype for Firefox web browser (works on 32-bit Firefox with 32-bit OpenCL drivers) and Samsung has open source Web prototype for WebKit (designed to work on Mac OSX Safari browser and Nvidia graphics card).

Nokia WebCL prototype for Firefox is an interesting project you can easily test your web browser by just installing a Firefox extension (you need to have a graphics card that supports OpenCL). Nokia WebCL web page has a web-based interactive photo editor demo utilizing GPU for image processing, other demos, WebCL programming tutorial and interactive WebCL kernel toy that allows you to easily test your own OpenCL/WebCL code. For introduction to Nokia WebCL look at Nokia WebCL prototype video:


  1. Tomi Engdahl says:

    CPU and GPU chips account for half of $111bn chip market
    Application specific cores are the next big thing

    ANALYST OUTFIT IMS Research claims that half of the processor market is made up of chips that have CPUs and GPUs integrated on the same die.

    The firm’s findings show that chips that include both CPU and GPU cores now account for half of the $111bn processor market. According to its report, this growth has all but destroyed the integrated graphics market, but the company said that discrete GPUs will continue to see growth.

    Tom Hackenberg, semiconductors research manager at IMS Research said, “Through the last decade the mobile and media consumption device markets have been pivotal for this hybridization trend; Apple, Broadcom, Marvell, Mediatek, Nvidia, Qualcomm, Samsung, St Ericsson, Texas Instruments and many other processor vendors have been offering heterogeneous application-specific processors with a microprocessor core integrating a GPU to add value within extremely confined parameters of space, power and cost.”

    Hackenberg made an interesting point as to why both AMD and Intel are pushing deeper into their respective CPU and GPU on-die strategies, suggesting that it is a way to easily design an embedded processor for use in handheld devices.

  2. Tomi Engdahl says:

    Rivals AMD and ARM unite, summon others to become ‘heterogeneous’

    HSA Foundation — a non-profit organization dedicated to promoting the dark arts of Heterogeneous System Architecture.

    It’s a relatively old concept in computing, but the Foundation’s founding partners (AMD, ARM, Imagination Technologies, MediaTek and Texas Instruments) all stand to gain from its wider adoption. How come? Because it involves boosting a chip’s performance by making it use its various components as co-processors, rather than treating them as specialized units that can never help each other out.

    In other words, while Intel pursues Moore’s Law and packs ever-more sophisticated transistors into its CPUs, AMD, ARM and the other HSA pals want to achieve similar or better results through parallel computing.

    In most cases, that’ll mean using the graphics processor on a chip not only for visuals and gaming, but also for general tasks and apps. This can already be achieved using a programming language called OpenCL, but AMD believes it’s too tricky to code and is putting mainstream developers off. Equally, NVIDIA has long had its own language for the same purpose, called CUDA, but it’s proprietary

    Whatever niche is left in the middle, the HSA Foundation hopes to fill it with an easier and more open standard that is not only cross-OS but also transcends the PC / mobile divide.

  3. Tomi Engdahl says:

    PowerVR To Make Mobile Graphics, GPU Compute a Three-Way Race Again

    For over 10 years, the desktop and mobile graphics space has been dominated by two players: Nvidia and AMD/ATI. After 3dfx collapsed,

    Now, there’s a flurry of evidence to suggest that Imagination Technologies plans to re-enter PC market

    ‘ There’s a new ray-tracing SDK out and a post discussing how PowerVR is utilizing GPU Compute and OpenCL to offload and accelerate CPU-centric tasks.

  4. Tomi says:

    Rootbeer GPU Compiler Lets Almost Any Java Code Run On the GPU

    “Today the source code to the Rootbeer GPU Compiler was released as open source on github. This work allows for a developer to use almost any Java code on the GPU. It is free, open source and highly tested.”

    Rootbeer GPU Compiler on github

    Downside of Java on GPU:

    Now, instead of having to bundle your blessed version of java and libraries with the app, you have to bundle those, AND a graphics card that speaks CUDA and has a specific driver version.

    That whole “write once, run anywhere” was never true, and not only because it’s stretching the word “run”. Now it might get even less true.

    Used intelligently by a skilled programmer, Java can deliver great results and provide exactly the sort of cross platform capabilities it was designed for. If you write only to the public APIs then Java truly is Write-Once-Run-Anywhere

    Used by idiots and/or kids who just earned that undergrad CS degree, it tends to provide less. Buggy software written by developers who have bad simple-platform habits usually runs nicely only one one environment. Some bad Java developers use internal functionality that can change between Java versions (these folks are used to coding to the Undocumented APIs of Win32 that you used to have to use to get things done). In Java you shouldn’t do this.

  5. Tomi Engdahl says:

    OpenCL Programming: Parallel Processing Made Faster and Easier than Ever

    Newer processors may not only have multiple cores of the same architecture, they may also integrate heterogeneous computing elements. Programming such devices with a single code base has just gotten easier.

    OpenCL for Unified, Portable Source Code

    OpenCL, the first open and royalty-free programming standard for general-purpose parallel computations on heterogeneous systems, is quickly growing in popularity as a means for developers to preserve their expensive source code investments and easily target multicore CPUs and GPUs.

    Ultimately, OpenCL enables developers to focus on applications, not just chip architectures, via a single, portable source code base. When using OpenCL, developers can use a unified tool chain and language to target all of the parallel processors currently in use. This is done by presenting the developer with an abstract platform model that conceptualizes all of these architectures in a similar way, as well as an execution model supporting data and task parallelism across heterogeneous architectures.

  6. Tomi Engdahl says:

    Developing Embedded Hybrid Code Using OpenCL

    Open Computing Language (OpenCL) is a specification and a programming framework for managing heterogeneous computing cores such as CPUs and graphics processing units (GPUs) to accelerate computationally intensive algorithms.

    Execution within an OpenCL program occurs in two places: kernels that execute on devices—most commonly GPUs—and a host program that executes on a host device—most commonly a CPU.

    Using tools like OpenCL is great for photo/video editing applications, AI systems, modeling frameworks, game physics, Hollywood rendering and augmented reality, to name a few. However, there is also an embedded profile for OpenCL defined in the specification that consists of a subset of the full OpenCL specification, targeted at embedded mobile devices. Here are some highlights of what the OpenCL Embedded Profile contains:

    • 64-bit integers are optional

    • Support for 3D images is optional

    • Relaxation of rounding rules for floating point calculations

    • Precision of conversions on an embedded device is clarified

    • Built-in atomic functions are optional

    Looking forward, the OpenCL roadmap contains a number of initiatives to take it to the next level of relevance.

  7. Tomi Engdahl says:

    Parallel Computing with AMD Fusion-Based Computer-on-Modules

    The integration of powerful graphics processors on the same die with multicore x86 processors is opening new areas of compute-intensive embedded applications.

  8. Tomi Engdahl says:

    Java Project Proposal: GPU support

    In accordance with the OpenJDK guidelines [1], we would like to start
    the discussion of a new project to explore implementing GPU support in
    Java with a native JVM. This project intends to enable Java
    applications to seamlessly take advantage of a GPU–whether it is a
    discrete device or integrated with a CPU–with the objective to
    improve the application performance.

    This project will demonstrate the performance advantages of offloading
    Java compute to a GPU. We propose to use the Hotspot JVM

  9. Over 30 years of DSP « Tomi Engdahl’s ePanorama blog says:

    [...] At the same time, traditional DSP devices have seen competition from an array of alternative signal-processing platforms: CPU with DSP-oriented features; digital signal controller (a DSP core with an MCU); FPGA; and massively parallel processing using graphics processor. [...]

  10. Tomi Engdahl says:

    Nvidia puts Tesla K20 GPU coprocessor through its paces
    Early results on Hyper-Q, Dynamic Parallelism speedup

    Back in May, when Nvidia divulged some of the new features of its top-end GK110 graphics chip, the company said that two new features of the GPU, Hyper-Q and Dynamic Parallelism, would help GPUs run more efficiently and without the CPU butting in all the time. Nvidia is now dribbling out some benchmark test results ahead of the GK110′s shipment later this year inside the Tesla K20 GPU coprocessor for servers.

    Nvidia has been vague about absolute performance, but El Reg expects for the GK110 to deliver just under 2 teraflops of raw DP floating point performance at 1GHz clock speeds on the cores and maybe 3.5 teraflops at single precision.

  11. Tomi Engdahl says:

    NVIDIA Floats Workstations Graphics to the Cloud,bid_26,aid_252592&dfpLayout=blog

    Though enterprise IT systems are on a direct path to virtualization and the cloud, engineering systems like CAD and PLM haven’t hit a similar stride, given the lingering and very real concerns about how rich, graphics-intensive 3D tools would perform in this new environment.

    Enter the GPU giant NVIDIA, which has a plan to bring workstation-level graphics capabilities to any screen, even mobile ones. The new cloud-based NVIDIA VGX K2 GPU, built on its previously announced Kepler architecture, fills in some of the pieces needed to achieve the promise of NVIDIA’s VGX platform announced in May. That platform, deployed in the datacenter, helps IT departments create a virtualized desktop with all the graphics and GPU computing horsepower of traditional PCs or high-end workstations. Employees then can access this desktop in the field or at home from the connected device of their choice (yes, that means laptops, tablets, smartphones, or thin clients).

    Before NVIDIA VGX, company officials say, common engineering tools like 3D CAD and simulation programs were deemed too graphics-intensive and too demanding to run efficiently and effectively on a virtualized desktop.

    Some experts likened NVIDIA’s approach to that of Citrix, which is well known in IT circles for bringing CPU virtualization capabilities to the desktop, so people can use their device of choice. Similarly, NVIDIA said its VGX platform brings the power of the Kepler GPU to the same lineup of mobile devices.

  12. Tomi Engdahl says:

    Cluster compo newcomers slap down winning hand: 6 NVIDIA Tesla cards

    SC12 Student Cluster Competition (SCC)

    Like all of the teams we’ve previously seen from China, they’re big believers in hybrid HPC, and their weapon of choice is six NVIDIA Tesla 2090 cards. They’re power-hungry at 225 watts each, but they can crank out as much as 10x the performance of a traditional CPU – a bargain from a power consumption standpoint.

  13. Tomi Engdahl says:

    New 25-GPU Monster Devours Strong Passwords In Minutes

    “A presentation at the Passwords^12 Conference in Oslo, Norway , has moved the goalposts on password cracking yet again. Speaking on Monday, researcher Jeremi Gosney (a.k.a epixoip) demonstrated a rig that leveraged the Open Computing Language (OpenCL) framework and a technology known as Virtual Open Cluster (VCL) to run the HashCat password cracking program across a cluster of five, 4U servers equipped with 25 AMD Radeon GPUs communicating at 10 Gbps and 20 Gbps over Infiniband switched fabric. Gosney’s system elevates password cracking to the next level, and effectively renders even the strongest passwords protected with weaker encryption algorithms, like Microsoft’s LM and NTLM, obsolete.”

    “Gosney’s cluster cranks out more than 77 million brute force attempts per second against MD5crypt.”

  14. Tomi Engdahl says:

    25-GPU cluster cracks every standard Windows password in <6 hours
    All your passwords are belong to us.

    A password-cracking expert has unveiled a computer cluster that can cycle through as many as 350 billion guesses per second. It's an almost unprecedented speed that can try every possible Windows passcode in the typical enterprise in less than six hours.

    The five-server system uses a relatively new package of virtualization software that harnesses the power of 25 AMD Radeon graphics cards.

    It achieves the 350 billion-guess-per-second speed when cracking password hashes generated by the NTLM cryptographic algorithm that Microsoft has included in every version of Windows since Server 2003.

    The Linux-based GPU cluster runs the Virtual OpenCL cluster platform, which allows the graphics cards to function as if they were running on a single desktop computer

    The advent of GPU computing over the past decade has contributed to huge boosts in offline password cracking. But until now, limitations imposed by computer motherboards, BIOS systems, and ultimately software drivers limited the number of graphics cards running on a single computer to eight.

    "Before VCL people were trying lots of different things to varying degrees of success," Gosney said. "VCL put an end to all of this, because now we have a generic solution that works right out of the box, and handles all of that complexity for you automatically. It's also really easy to manage because all of your compute nodes only have to have VCL installed, nothing else. You only have your software installed on the cluster controller."

    The precedent set by the new cluster means it's more important than ever for engineers to design password storage systems that use hash functions specifically suited to the job. Unlike, MD5, SHA1, SHA2, the recently announced SHA3, and a variety of other "fast" algorithms, functions such as Bcrypt, PBKDF2, and SHA512crypt are designed to expend considerably more time and computing resources to convert plaintext input into cryptographic hashes. As a result, the new cluster, even with its four-fold increase in speed, can make only 71,000 guesses against Bcrypt and 364,000 guesses against SHA512crypt.

  15. Tomi Engdahl says:

    Leveraging the GPU to accelerate the Linux kernel

    Powerful graphics cards are pretty affordable these days.

    The KGPU package enlists your graphics card to help the kernel do some heavy lifting.

    This won’t work for just any GPU. The technique uses CUDA, which is a parallel computing package for NVIDIA hardware.

  16. Tomi says:

    Fall Fury: Part 2 – Shaders

    In simple terms, shaders are small programs that are executed on the Graphical Processing Unit (GPU) instead of the Central Processing Unit (CPU). In recent years, we’ve seen a major spike in the capabilities of graphic devices, allowing hardware manufacturers to design an execution layer tied to the GPU, therefore being able to target device-specific manipulations to a highly optimized unit. Shaders are not used for simple calculations, but rather for image processing. For example, a shader can be used to adjust image lighting or colors.

    Modern GPUs give access to the rendering pipeline that allows developers to execute arbitrary image processing code.

    vertex shaders translate the coordinates of a vector in 3D space in relation to the 2D frame. Vertex shaders are executed one time per vector passed to the GPU.

    Pixel shaders – these programs are executed on the GPU in relation to every passed pixel
    if you want specific pixels adjusted for lighting or 3D bump mapping, a pixel shader can provide the desired effect for a surface.

    Geometry shaders – these shaders are the next progression from vertex shaders, introduced with DirectX 10. The developer is able to pass specific primitives as input and either have the output represent the modified version of what was passed to the program or have new primitives, such as triangles, be generated as a result. Geometry shaders are always executed on post-vertex processing in the rendering pipeline.

  17. Tomi Engdahl says:

    OpenCL drivers discovered on Nexus 4 and Nexus 10 devices

    Companies such as AMD, Intel and Nvidia have been shipping OpenCL drivers on the desktop for some time now. On the mobile side, vendors such as ARM, Imagination, Qualcomm, Samsung and TI have been promising OpenCL on mobile and often show off demos using OpenCL. Drivers from vendors such as ARM, Qualcomm and Imagination have also passed official conformance tests, certifying that they do have working drivers in at least development firmwares.

    However, none of the vendors have publically announced whether or not they are already shipping OpenCL in stock firmware on any device.

    However, recently we have seen several stories that OpenCL drivers are in fact present on both Nexus 4 and Nexus 10 stock firmware.

  18. Tomi Engdahl says:

    Using OpenCL for Network Acceleration

    Investigating the practicality of using OpenCL to accelerate AES and DES Encryption and Decryption by leveraging the GPU engines of the APU, reveals a realm of possibilities for exploiting parallelism on hybrid processors.

    Microprocessor designs are trending in favor of a higher number of cores per socket versus increased clock speed. Increasingly, more cores are being integrated on the same die to fully take advantage of high-speed interconnects for interprocessor communications. Companies like Advanced Micro Devices (AMD) are innovating high-performance computing by integrating graphics with x86 CPUs to create what AMD refers to as Accelerated Processing Units (APUs).

    The advent of the APU creates opportunities for designers to develop solutions not possible a few years ago. These solutions utilize multiple languages and execute across hardware execution domain to enable a wide variety of new applications. One such application is the use of GPU resources as a massively parallel “off-load” engine for computationally intense algorithms in security and networking.

    While the results for each of the algorithms differed slightly on both the CPU (blue lines) and GPU (red lines), the overall trend for each was very consistent. It is clear that when the network traffic load was relatively light—meaning that there were not many concurrent threads required to support the algorithm—the CPUs were more than adequate and in fact more efficient than using OpenCL and the GPU cores. However, as the workload and number of concurrent threads increased, the OpenCL and GPU proved to be a significantly better solution.

    The beauty of the APU-based architecture is that it allows the designer to decide when and if to use the GPU resources and how.

  19. mozila fire says:

    Wow! It’s a real shame more folks don’t know about this site,
    it covered everything I needed!!!

  20. Monster Warlord says:

    I was wondering if you ever considered changing the structure
    of your site? Its very well written; I love what youve got to say.
    But maybe you could a little more in the way of content so people could connect with it
    better. Youve got an awful lot of text for only having one or two pictures.
    Maybe you could space it out better?

  21. Tomi Engdahl says:

    Implementing FPGA Design with the OpenCL Standard

    Utilizing the Khronos Group’s OpenCL™ standard on an FPGA may offer significantly higher performance and at much lower power than is available today from hardware architectures such as CPUs, graphics processing units (GPUs), and digital signal processing (DSP) units. In addition, an FPGA-based heterogeneous system (CPU + FPGA) using the OpenCL standard has a significant time-to-market advantage compared to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL

  22. Tomi Engdahl says:

    GPU Compute and OpenCL: an introduction.

    This article provides, to the reader unfamiliar with the subject, an introduction to the GPU evolution, current architecture, and suitability for compute intensive applications.

  23. Tomi Engdahl says:

    The Most Under-rated FPGA Design Tool Ever;

    There is a design tool that is being quietly adopted by FPGA engineers because, in many cases, it produces results that are better than hand-coded counterparts.

    FPGAs keep getting larger, the designs more complex, and the need for high level design (HLD) flows never seems to go away. C-based design for FPGAs has been promoted for over two decades and several such tools are currently on the market. Model-based design has also been around for a long time from multiple vendors. OpenCL for FPGAs has been getting lots of press in the last couple of years. Yet, despite all of this, 90+% of FPGA designs continue to be built using traditional Verilog or VHDL.

    No one can deny the need for HLD. New FPGAs contain over 1 million logic elements, with thousands of hardened DSP and memory blocks. Some vendor’s devices can even support floating-point as efficiently as fixed-point arithmetic. Data convertor and interface protocols routinely run at multiple GSPS (giga samples per second), requiring highly parallel or vectorized processing. Timing closure, simulation, and verification become ever-more time-consuming as design sizes grow. But HLD adoption still lags, and FPGAs are primarily programmed by hardware-centric engineers using traditional hardware description languages (HDLs).

    The primary reason for this is quality of results (QoR). All high-level design tools have two key challenges to overcome. One is to translate the designer’s intent into implementation when the design is described in a high-level format. This is especially difficult when software programming languages are used (C++, MATLAB, or others), which are inherently serial in nature. It is then up to the compiler to decide by how much and where to parallelize the hardware implementation. This can be aided by adding special intrinsics into the design language, but this defeats the purpose. OpenCL addresses this by having the programmer describe serial dependencies in the datapath, which is why OpenCL is often used for programming GPUs. It is then up to the OpenCL compiler to decide how to balance parallelism against throughput in the implementation. However, OpenCL programming is not exactly a common skillset in the industry.


Leave a Comment

Your email address will not be published. Required fields are marked *