Nama : Basri Ananta
NPM : 52414043
Kelas : 4IA22
Kelompok 6
1.1 Parallelism and
Computing
A parallel computer is
a set of processors that are able to work cooperatively to solve a
computational problem. This definition is broad enough to include parallel
supercomputers that have hundreds or thousands of processors, networks of
workstations, multiple-processor workstations, and embedded systems. Parallel
computers are interesting because they offer the potential to concentrate
computational resources—whether processors, memory, or I/O bandwidth—on important
computational problems.
Parallelism has
sometimes been viewed as a rare and exotic subarea of computing, interesting
but of little relevance to the average programmer. A study of trends in
applications, computer architecture, and networking shows that this view is no
longer tenable. Parallelism is becoming ubiquitous, and parallel programming is
becoming central to the programming enterprise.
1.1.1 Trends in
Applications
As computers become
ever faster, it can be tempting to suppose that they will eventually become “fast enough”
and that appetite for increased computing power will be sated. However, history
suggests that as a particular technology satisfies known applications, new
applications will arise that are enabled by that technology and that will
demand the development of new technology. As an amusing illustration of this
phenomenon, a report prepared for the British government in the late 1940s
concluded that Great Britain’s computational requirements could be met by two
or perhaps three computers. In those days, computers were used primarily for
computing ballistics tables. The authors of the report did not consider other
applications in science and engineering, let alone the commercial applications
that would soon come to dominate computing. Similarly, the initial prospectus
for Cray Research predicted a market for ten supercomputers; many hundreds have
since been sold.
Traditionally,
developments at the high end of computing have been motivated by numerical
simulations of complex systems such as weather, climate, mechanical devices, electronic
circuits, manufacturing processes, and
chemical reactions. However, the most significant forces driving the
development of faster computers today are emerging commercial applications that
require a computer to be able to process large amounts of data in sophisticated
ways. These applications include video
conferencing, collaborative work environments,
computer-aided diagnosis in medicine, parallel databases used for decision support, and advanced graphics and
virtual reality, particularly in the
entertainment industry. For example, the integration of parallel computation,
high-performance networking, and multimedia technologies is leading to the
development of video servers, computers
designed to serve hundreds or thousands of simultaneous requests for real-time
video. Each video stream can involve both data transfer rates of many megabytes
per second and large amounts of processing for encoding and decoding. In
graphics, three-dimensional data sets are now approaching volume elements (1024 on a side). At 200
operations per element, a display updated 30 times per second requires a
computer capable of 6.4 operations per second.
Although commercial
applications may define the architecture of most future parallel computers,
traditional scientific applications will remain important users of parallel
computing technology. Indeed, as nonlinear effects place limits on the insights
offered by purely theoretical investigations and as experimentation becomes
more costly or impractical, computational studies of complex systems are
becoming ever more important. Computational costs typically increase as the
fourth power or more of the “resolution” that determines accuracy, so these
studies have a seemingly insatiable demand for more computer power. They are
also often characterized by large memory and input/output requirements. For
example, a ten-year simulation of the earth’s climate using a state-of-the-art
model may involve floating-point
operations—ten days at an execution speed of
floating-point operations per second (10 gigaflops). This same
simulation can easily generate a hundred gigabytes ( bytes) or more of data.
Yet as Table 1.1 shows, scientists can easily imagine refinements to these
models that would increase these computational requirements 10,000 times.
Table 1.1: Various
refinements proposed to climate models, and the increased computational
requirements associated with these refinements. Altogether, these refinements
could increase computational requirements by a factor of between and .
In summary, the need
for faster computers is driven by the demands of both data-intensive
applications in commerce and computation-intensive applications in science and
engineering. Increasingly, the requirements of these fields are merging, as
scientific and engineering applications become more data intensive and
commercial applications perform more sophisticated computations.
1.1.2 Trends in
Computer Design
The performance of the
fastest computers has grown exponentially from
1945 to the present, averaging a factor of 10 every five years. While
the first computers performed a few tens of floating-point operations per second, the parallel computers
of the mid-1990s achieve tens of billions of operations per second (Figure
1.1). Similar trends can be observed in the low-end computers of different
eras: the calculators, personal computers, and workstations. There is little to
suggest that this growth will not continue. However, the computer architectures
used to sustain this growth are changing radically—from sequential to parallel.
Figure 1.1: Peak
performance of some of the fastest supercomputers, 1945–1995. The exponential
growth flattened off somewhat in the 1980s but is accelerating again as
massively parallel supercomputers become available. Here, “o” are
uniprocessors, “+” denotes modestly parallel vector computers with 4–16
processors, and “x” denotes massively parallel computers with hundreds or
thousands of processors. Typically, massively parallel computers achieve a
lower proportion of their peak performance on realistic applications than do
vector computers.
The performance of a
computer depends directly on the time required to perform a basic operation and
the number of these basic operations
that can be performed concurrently. The time to perform a basic operation is ultimately limited by the “clock
cycle” of the processor, that is, the time required to perform the most
primitive operation. However, clock cycle times are decreasing slowly and
appear to be approaching physical limits such as the speed of light (Figure
1.2). We cannot depend on faster processors to provide increased computational
performance.
Figure 1.2: Trends in
computer clock cycle times. Conventional vector supercomputer cycle times
(denoted “o”) have decreased only by a factor of 3 in sixteen years, from the
CRAY-1 (12.5 nanoseconds) to the C90 (4.0). RISC microprocessors (denoted “+”)
are fast approaching the same performance. Both architectures appear to be
approaching physical limits.
To circumvent these
limitations, the designer may attempt to utilize internal concurrency in a chip,
for example, by operating simultaneously on all 64 bits of two numbers that are
to be multiplied. However, a fundamental result in Very Large Scale Integration (VLSI) complexity theory says
that this strategy is expensive. This result states that for certain transitive
computations (in which any output may depend on any input), the chip area A and
the time T required to perform this computation are related so that must exceed some problem-dependent function
of problem size. This result can be explained informally by assuming that a
computation must move a certain amount of information from one side of a square
chip to the other. The amount of information that can be moved in a time unit
is limited by the cross section of the chip, . This gives a transfer rate of ,
from which the relation is obtained. To
decrease the time required to move the information by a certain factor, the
cross section must be increased by the same factor, and hence the total area
must be increased by the square of that factor.
This result means that not only is it difficult to
build individual components that operate faster, it may not even be desirable
to do so. It may be cheaper to use more, slower components. For example, if we
have an area of silicon to use in a
computer, we can either build
components, each of size A and able to perform an operation in time T ,
or build a single component able to perform the same operation in time T/n .
The multicomponent system is potentially n times faster.
Computer designers use
a variety of techniques to overcome these
limitations on single computer performance, including pipelining
(different stages of several instructions execute concurrently) and multiple
function units (several multipliers, adders, etc., are controlled by a single
instruction stream). Increasingly, designers are incorporating multiple
“computers,” each with its own processor, memory, and associated
interconnection logic. This approach is
facilitated by advances in VLSI technology that continue to decrease the
number of components required to implement a computer. As the cost of a
computer is (very approximately) proportional to the number of components that
it contains, increased integration also increases the number of processors that
can be included in a computer for a particular cost. The result is continued
growth in processor counts (Figure 1.3).
Figure 1.3: Number of
processors in massively parallel computers (“o”) and vector multiprocessors
(“+”). In both cases, a steady increase in processor count is apparent. A
similar trend is starting to occur in workstations, and personal computers can
be expected to follow the same trend.
1.1.3 Trends in
Networking
Another important trend changing the face of
computing is an enormous increase in
the capabilities of the networks that connect computers. Not long ago,
high-speed networks ran at 1.5 Mbits per second; by the end of the 1990s,
bandwidths in excess of 1000 Mbits per second will be commonplace. Significant
improvements in reliability are also expected. These trends make it feasible to
develop applications that use physically distributed resources as if they were
part of the same computer. A typical application of this sort may utilize
processors on multiple remote computers, access a selection of remote
databases, perform rendering on one or more graphics computers, and provide
real-time output and control on a workstation.
We emphasize that computing on networked
computers (“distributed computing”) is not just a subfield of parallel
computing. Distributed computing is deeply concerned with problems such as
reliability, security, and heterogeneity that are generally regarded as
tangential in parallel computing. (As Leslie Lamport has observed, “A
distributed system is one in which the failure of a computer you didn’t even
know existed can render your own computer unusable.”) Yet the basic task of
developing programs that can run on many computers at once is a parallel
computing problem. In this respect, the previously distinct worlds of parallel
and distributed computing are converging.
1.1.4 Summary of Trends
This brief survey of
trends in applications, computer architecture, and networking suggests a future
in which parallelism pervades not only supercomputers but also workstations,
personal computers, and networks. In this future, programs will be required to
exploit the multiple processors located inside each computer and the additional
processors available across a network. Because most existing algorithms are specialized for a single processor, this
situation implies a need for new
algorithms and program structures able to perform many operations at once.
Concurrency becomes a fundamental requirement for algorithms and programs.
This survey also
suggests a second fundamental lesson. It appears likely that processor counts
will continue to increase—perhaps, as they do in some environments at present,
by doubling each year or two. Hence, software systems can be expected to
experience substantial increases in processor count over their lifetime. In
this environment, scalability
—resilience to increasing processor
counts—is as important as portability for protecting software investments. A
program able to use only a fixed number of processors is a bad program, as is a
program able to execute on only a single computer. Scalability is a major theme
that will be stressed throughout this book.
2.2 Distributed Concept
Distributed computing
is a field of computer science that studies distributed systems. A distributed
system is a model in which components located on networked computers
communicate and coordinate their actions by passing messages. The components
interact with each other in order to achieve a common goal. Three significant
characteristics of distributed systems are: concurrency of components, lack of
a global clock, and independent failure of components. Examples of distributed
systems vary from SOA-based systems to massively multiplayer online games to
peer-to-peer applications.
A computer program that
runs in a distributed system is called a distributed program, and distributed
programming is the process of writing such programs. There are many
alternatives for the message passing mechanism, including pure HTTP, RPC-like
connectors and message queues.
A goal and challenge
pursued by some computer scientists and practitioners in distributed systems is
location transparency; however, this goal has fallen out of favour in industry,
as distributed systems are different from conventional non-distributed systems,
and the differences, such as network partitions, partial system failures, and
partial upgrades, cannot simply be “papered over” by attempts at “transparency”
(see CAP theorem).[citation needed]
Distributed computing
also refers to the use of distributed systems to solve computational problems.
In distributed computing, a problem is divided into many tasks, each of which
is solved by one or more computers. which communicate with each other by
message passing.
Distributed systems are
groups of networked computers, which have the same goal for their work. The
terms “concurrent computing”, “parallel computing”, and “distributed computing”
have a lot of overlap, and no clear distinction exists between them. The same
system may be characterized both as “parallel” and “distributed”; the
processors in a typical distributed system run concurrently in parallel.
Parallel computing may be seen as a particular tightly coupled form of
distributed computing, and distributed computing may be seen as a loosely
coupled form of parallel computing. Nevertheless, it is possible to roughly
classify concurrent systems as “parallel” or “distributed” using the following
criteria:
In parallel computing,
all processors may have access to a shared memory to exchange information
between processors. In distributed
computing, each processor has its own private memory (distributed memory).
Information is exchanged by passing messages between the processors.
The figure on the right
illustrates the difference between distributed and parallel systems. Figure (a)
is a schematic view of a typical distributed system; the system is represented
as a network topology in which each node is a computer and each line connecting
the nodes is a communication link. Figure (b) shows the same distributed system
in more detail: each computer has its own local memory, and information can be
exchanged only by passing messages from one node to another by using the
available communication links. Figure (c) shows a parallel system in which each
processor has a direct access to a shared memory.
The situation is
further complicated by the traditional uses of the terms parallel and
distributed algorithm that do not quite match the above definitions of parallel
and distributed systems (see below for more detailed discussion). Nevertheless,
as a rule of thumb, high-performance parallel computation in a shared-memory
multiprocessor uses parallel algorithms while the coordination of a large-scale
distributed system uses distributed algorithms.
3.3 PARALLEL COMPUTER
ARCHITECTURE
Parallel processing has
been developed as an effective technology in modern computers to meet the
demand for higher performance, lower cost and accurate results in real-life
applications. Concurrent events are common in today’s computers due to the
practice of multiprogramming, multiprocessing, or multicomputing.
Modern computers have
powerful and extensive software packages. To analyze the development of the
performance of computers, first we have to understand the basic development of
hardware and software.
Computer Development
Milestones − There is two major stages of development of computer – mechanical
or electromechanical parts. Modern computers evolved after the introduction of
electronic components. High mobility electrons in electronic computers replaced
the operational parts in mechanical computers. For information transmission,
electric signal which travels almost at the speed of a light replaced
mechanical gears or levers.
Elements of Modern
computers − A modern computer system consists of computer hardware, instruction
sets, application programs, system software and user interface.
The computing problems
are categorized as numerical computing, logical reasoning, and transaction
processing. Some complex problems may need the combination of all the three
processing modes.
Evolution of Computer
Architecture − In last four decades, computer architecture has gone through
revolutionary changes. We started with Von Neumann architecture and now we have
multicomputers and multiprocessors.
Performance of a
computer system − Performance of a computer system depends both on machine
capability and program behavior. Machine capability can be improved with better
hardware technology, advanced architectural features and efficient resource
management. Program behavior is unpredictable as it is dependent on application
and run-time conditions
Multiprocessors and
Multicomputers
In this section, we
will discuss two types of parallel computers −
Multiprocessors
Multicomputers
Shared-Memory
Multicomputers
Three most common
shared memory multiprocessors models are −
Uniform Memory Access
UMAUMA
In this model, all the
processors share the physical memory uniformly. All the processors have equal
access time to all the memory words. Each processor may have a private cache
memory. Same rule is followed for peripheral devices.
When all the processors
have equal access to all the peripheral devices, the system is called a
symmetric multiprocessor. When only one or a few processors can access the
peripheral devices, the system is called an asymmetric multiprocessor.
Non-uniform Memory
Access NUMANUMA
In NUMA multiprocessor
model, the access time varies with the location of the memory word. Here, the
shared memory is physically distributed among all the processors, called local
memories. The collection of all local memories forms a global address space
which can be accessed by all the processors.
Cache Only Memory
Architecture COMACOMA
The COMA model is a
special case of the NUMA model. Here, all the distributed main memories are
converted to cache memories.
Distributed – Memory
Multicomputers − A distributed memory multicomputer system consists of multiple
computers, known as nodes, inter-connected by message passing network. Each
node acts as an autonomous computer having a processor, a local memory and
sometimes I/O devices. In this case, all local memories are private and are
accessible only to the local processors. This is why, the traditional machines
are called no-remote-memory-access NORMANORMA machines.
Multivector and SIMD
Computers
In this section, we
will discuss supercomputers and parallel processors for vector processing and
data parallelism.
Vector Supercomputers
In a vector computer, a
vector processor is attached to the scalar processor as an optional feature.
The host computer first loads program and data to the main memory. Then the
scalar control unit decodes all the instructions. If the decoded instructions are
scalar operations or program operations, the scalar processor executes those
operations using scalar functional pipelines.
On the other hand, if
the decoded instructions are vector operations then the instructions will be
sent to vector control unit.
SIMD Supercomputers
In SIMD computers, ‘N’
number of processors are connected to a control unit and all the processors
have their individual memory units. All the processors are connected by an
interconnection network.
PRAM and VLSI Models
The ideal model gives a
suitable framework for developing parallel algorithms without considering the
physical constraints or implementation details.
The models can be
enforced to obtain theoretical performance bounds on parallel computers or to
evaluate VLSI complexity on chip area and operational time before the chip is
fabricated.
Parallel Random-Access
Machines
Sheperdson and Sturgis
19631963 modeled the conventional Uniprocessor computers as
random-access-machines RAMRAM. Fortune and Wyllie 19781978 developed a parallel
random-access-machine PRAMPRAMmodel for modeling an idealized parallel computer
with zero memory access overhead and synchronization.
An N-processor PRAM has
a shared memory unit. This shared memory can be centralized or distributed
among the processors. These processors operate on a synchronized read-memory,
write-memory and compute cycle. So, these models specify how concurrent read and
write operations are handled.
Following are the possible
memory update operations −
Exclusive read ERER −
In this method, in each cycle only one processor is allowed to read from any
memory location.
Exclusive write EWEW −
In this method, at least one processor is allowed to write into a memory
location at a time.
Concurrent read CRCR −
It allows multiple processors to read the same information from the same memory
location in the same cycle.
Concurrent write CWCW −
It allows simultaneous write operations to the same memory location. To avoid
write conflict some policies are set up.
VLSI Complexity Model
Parallel computers use
VLSI chips to fabricate processor arrays, memory arrays and large-scale
switching networks.
Nowadays, VLSI
technologies are 2-dimensional. The size of a VLSI chip is proportional to the
amount of storage memorymemory space available in that chip.
We can calculate the
space complexity of an algorithm by the chip area AA of the VLSI chip
implementation of that algorithm. If T is the time latencylatency needed to
execute the algorithm, then A.T gives an upper bound on the total number of
bits processed through the chip orI/OorI/O. For certain computing, there exists
a lower bound, fss, such that
A.T2 >= O f(sf(s)
Where A=chip area and
T=time
Architectural
Development Tracks
The evolution of
parallel computers I spread along the following tracks −
Multiple Processor
Tracks
Multiprocessor track
Multicomputer track
Multiple data track
Vector track
SIMD track
Multiple threads track
Multithreaded track
Dataflow track
In multiple processor
track, it is assumed that different threads execute concurrently on different
processors and communicate through shared memory
multiprocessortrackmultiprocessortrack or message passing
multicomputertrackmulticomputertrack system.
In multiple data track,
it is assumed that the same code is executed on the massive amount of data. It
is done by executing same instructions on a sequence of data elements
vectortrackvectortrack or through the execution of same sequence of
instructions on a similar set of data SIMDtrackSIMDtrack.
In multiple threads
track, it is assumed that the interleaved execution of various threads on the
same processor to hide synchronization delays among threads executing on
different processors. Thread interleaving can be coarse
multithreadedtrackmultithreadedtrack or fine dataflowtrackdataflowtrack.
4.4 Threaded Programming
In computer science, a
thread of execution is the smallest sequence of programmed instructions that
can be managed independently by a scheduler, which is typically a part of the
operating system.[1] The implementation of threads and processes differs between
operating systems, but in most cases a thread is a component of a process.
Multiple threads can exist within one process, executing concurrently and
sharing resources such as memory, while different processes do not share these
resources. In particular, the threads of a process share its executable code
and the values of its variables at any given time.
Systems with a single
processor generally implement multithreading by time slicing: the central
processing unit (CPU) switches between different software threads. This context
switching generally happens very often and rapidly enough that users perceive
the threads or tasks as running in parallel. On a multiprocessor or multi-core
system, multiple threads can execute in parallel, with every processor or core
executing a separate thread simultaneously; on a processor or core with
hardware threads, separate software threads can also be executed concurrently
by separate hardware threads.
Threads differ from
traditional multitasking operating system processes in that:
processes are typically
independent, while threads exist as subsets of a process
processes carry
considerably more state information than threads, whereas multiple threads
within a process share process state as well as memory and other resources
processes have separate
address spaces, whereas threads share their address space
processes interact only
through system-provided inter-process communication mechanisms
context switching
between threads in the same process is typically faster than context switching
between processes.
Systems such as Windows
NT and OS/2 are said to have cheap threads and expensive processes; in other
operating systems there is not so great a difference except the cost of an
address space switch which on some architectures (notably x86) results in a
translation lookaside buffer (TLB) flush.
Multithreading is
mainly found in multitasking operating systems. Multithreading is a widespread
programming and execution model that allows multiple threads to exist within
the context of one process. These threads share the process’s resources, but
are able to execute independently. The threaded programming model provides
developers with a useful abstraction of concurrent execution. Multithreading
can also be applied to one process to enable parallel execution on a
multiprocessing system.
Multithreaded
applications have the following advantages:
Responsiveness:
multithreading can allow an application to remain responsive to input. In a
one-thread program, if the main execution thread blocks on a long-running task,
the entire application can appear to freeze. By moving such long-running tasks
to a worker thread that runs concurrently with the main execution thread, it is
possible for the application to remain responsive to user input while executing
tasks in the background. On the other hand, in most cases multithreading is not
the only way to keep a program responsive, with non-blocking I/O and/or Unix
signals being available for gaining similar results.[6]
Faster execution: this
advantage of a multithreaded program allows it to operate faster on computer
systems that have multiple central processing units (CPUs) or one or more
multi-core processors, or across a cluster of machines, because the threads of
the program naturally lend themselves to parallel execution, assuming
sufficient independence (that they do not need to wait for each other).
Lower resource
consumption: using threads, an application can serve multiple clients
concurrently using fewer resources than it would need when using multiple
process copies of itself. For example, the Apache HTTP server uses thread
pools: a pool of listener threads for listening to incoming requests, and a
pool of server threads for processing those requests.
Better system
utilization: as an example, a file system using multiple threads can achieve
higher throughput and lower latency since data in a faster medium (such as
cache memory) can be retrieved by one thread while another thread retrieves
data from a slower medium (such as external storage) with neither thread
waiting for the other to finish.
Simplified sharing and
communication: unlike processes, which require a message passing or shared
memory mechanism to perform inter-process communication (IPC), threads can
communicate through data, code and files they already share.
Parallelization:
applications looking to use multicore or multi-CPU systems can use
multithreading to split data and tasks into parallel subtasks and let the
underlying architecture manage how the threads run, either concurrently on one
core or in parallel on multiple cores. GPU computing environments like CUDA and
OpenCL use the multithreading model where dozens to hundreds of threads run in
parallel across data on a large number of cores.
Multithreading has the
following drawbacks:
Synchronization: since
threads share the same address space, the programmer must be careful to avoid
race conditions and other non-intuitive behaviors. In order for data to be
correctly manipulated, threads will often need to rendezvous in time in order
to process the data in the correct order. Threads may also require mutually
exclusive operations (often implemented using mutexes) in order to prevent
common data from being simultaneously modified or read while in the process of
being modified. Careless use of such primitives can lead to deadlocks,
livelocks or races over a resources.
Thread crashes a
process: an illegal operation performed by a thread crashes the entire process;
therefore, one misbehaving thread can disrupt the processing of all the other
threads in the application.
Scheduling can be done
at the kernel level or user level, and multitasking can be done preemptively or
cooperatively. This yields a variety of related concepts.
At the kernel level, a
process contains one or more kernel threads, which share the process’s
resources, such as memory and file handles – a process is a unit of resources,
while a thread is a unit of scheduling and execution. Kernel scheduling is
typically uniformly done preemptively or, less commonly, cooperatively. At the
user level a process such as a runtime system can itself schedule multiple
threads of execution. If these do not share data, as in Erlang, they are
usually analogously called processes,[7] while if they share data they are
usually called (user) threads, particularly if preemptively scheduled.
Cooperatively scheduled user threads are known as fibers; different processes
may schedule user threads differently. User threads may be executed by kernel
threads in various ways (one-to-one, many-to-one, many-to-many). The term
“light-weight process” variously refers to user threads or to kernel mechanisms
for scheduling user threads onto kernel threads.
A process is a
“heavyweight” unit of kernel scheduling, as creating, destroying, and switching
processes is relatively expensive. Processes own resources allocated by the
operating system. Resources include memory (for both code and data), file
handles, sockets, device handles, windows, and a process control block.
Processes are isolated by process isolation, and do not share address spaces or
file resources except through explicit methods such as inheriting file handles
or shared memory segments, or mapping the same file in a shared way – see interprocess
communication. Creating or destroying a process is relatively expensive, as
resources must be acquired or released. Processes are typically preemptively
multitasked, and process switching is relatively expensive, beyond basic cost
of context switching, due to issues such as cache flushing.[a]
A kernel thread is a
“lightweight” unit of kernel scheduling. At least one kernel thread exists
within each process. If multiple kernel threads exist within a process, then
they share the same memory and file resources. Kernel threads are preemptively
multitasked if the operating system’s process scheduler is preemptive. Kernel
threads do not own resources except for a stack, a copy of the registers
including the program counter, and thread-local storage (if any), and are thus
relatively cheap to create and destroy. Thread switching is also relatively
cheap: it requires a context switch (saving and restoring registers and stack
pointer), but does not change virtual memory and is thus cache-friendly
(leaving TLB valid). The kernel can assign one thread to each logical core in a
system (because each processor splits itself up into multiple logical cores if
it supports multithreading, or only supports one logical core per physical core
if it does not), and can swap out threads that get blocked. However, kernel
threads take much longer than user threads to be swapped.
Threads are sometimes
implemented in userspace libraries, thus called user threads. The kernel is
unaware of them, so they are managed and scheduled in userspace. Some
implementations base their user threads on top of several kernel threads, to
benefit from multi-processor machines (M:N model). In this article the term
“thread” (without kernel or user qualifier) defaults to referring to kernel
threads. User threads as implemented by virtual machines are also called green
threads. User threads are generally fast to create and manage, but cannot take
advantage of multithreading or multiprocessing, and will get blocked if all of
their associated kernel threads get blocked even if there are some user threads
that are ready to run.
Fibers are an even
lighter unit of scheduling which are cooperatively scheduled: a running fiber
must explicitly “yield” to allow another fiber to run, which makes their
implementation much easier than kernel or user threads. A fiber can be
scheduled to run in any thread in the same process. This permits applications
to gain performance improvements by managing scheduling themselves, instead of
relying on the kernel scheduler (which may not be tuned for the application).
Parallel programming environments such as OpenMP typically implement their
tasks through fibers. Closely related to fibers are coroutines, with the
distinction being that coroutines are a language-level construct, while fibers
are a system-level construct.
Thread and fiber issues
Concurrency and data
structures
Threads in the same
process share the same address space. This allows concurrently running code to
couple tightly and conveniently exchange data without the overhead or
complexity of an IPC. When shared between threads, however, even simple data
structures become prone to race conditions if they require more than one CPU
instruction to update: two threads may end up attempting to update the data
structure at the same time and find it unexpectedly changing underfoot. Bugs
caused by race conditions can be very difficult to reproduce and isolate.
To prevent this,
threading application programming interfaces (APIs) offer synchronization
primitives such as mutexes to lock data structures against concurrent access.
On uniprocessor systems, a thread running into a locked mutex must sleep and
hence trigger a context switch. On multi-processor systems, the thread may
instead poll the mutex in a spinlock. Both of these may sap performance and
force processors in symmetric multiprocessing (SMP) systems to contend for the
memory bus, especially if the granularity of the locking is fine.
Although threads seem
to be a small step from sequential computation, in fact, they represent a huge
step. They discard the most essential and appealing properties of sequential
computation: understandability, predictability, and determinism. Threads, as a
model of computation, are wildly non-deterministic, and the job of the
programmer becomes one of pruning that nondeterminism.
— The Problem with
Threads, Edward A. Lee, UC Berkeley, 2006[8]
I/O and scheduling
User thread or fiber
implementations are typically entirely in userspace. As a result, context
switching between user threads or fibers within the same process is extremely
efficient because it does not require any interaction with the kernel at all: a
context switch can be performed by locally saving the CPU registers used by the
currently executing user thread or fiber and then loading the registers
required by the user thread or fiber to be executed. Since scheduling occurs in
userspace, the scheduling policy can be more easily tailored to the requirements
of the program’s workload.
However, the use of
blocking system calls in user threads (as opposed to kernel threads) or fibers
can be problematic. If a user thread or a fiber performs a system call that
blocks, the other user threads and fibers in the process are unable to run
until the system call returns. A typical example of this problem is when
performing I/O: most programs are written to perform I/O synchronously. When an
I/O operation is initiated, a system call is made, and does not return until
the I/O operation has been completed. In the intervening period, the entire
process is “blocked” by the kernel and cannot run, which starves other user
threads and fibers in the same process from executing.
A common solution to
this problem is providing an I/O API that implements a synchronous interface by
using non-blocking I/O internally, and scheduling another user thread or fiber
while the I/O operation is in progress. Similar solutions can be provided for
other blocking system calls. Alternatively, the program can be written to avoid
the use of synchronous I/O or other blocking system calls.
SunOS 4.x implemented
light-weight processes or LWPs. NetBSD 2.x+, and DragonFly BSD implement LWPs
as kernel threads (1:1 model). SunOS 5.2 through SunOS 5.8 as well as NetBSD 2
to NetBSD 4 implemented a two level model, multiplexing one or more user level
threads on each kernel thread (M:N model). SunOS 5.9 and later, as well as
NetBSD 5 eliminated user threads support, returning to a 1:1 model.[9] FreeBSD
5 implemented M:N model. FreeBSD 6 supported both 1:1 and M:N, users could
choose which one should be used with a given program using /etc/libmap.conf.
Starting with FreeBSD 7, the 1:1 became the default. FreeBSD 8 no longer
supports the M:N model.
The use of kernel
threads simplifies user code by moving some of the most complex aspects of
threading into the kernel. The program does not need to schedule threads or
explicitly yield the processor. User code can be written in a familiar
procedural style, including calls to blocking APIs, without starving other
threads. However, kernel threading may force a context switch between threads
at any time, and thus expose race hazards and concurrency bugs that would
otherwise lie latent. On SMP systems, this is further exacerbated because
kernel threads may literally execute on separate processors in parallel.
5.5 Message Passing
In computer science,
message passing is a technique for invoking behavior (i.e., running a program)
on a computer. The invoking program sends a message to a process (which may be
an actor or object) and relies on the process and the supporting infrastructure
to select and invoke the actual code to run. Message passing differs from
conventional programming where a process, subroutine, or function is directly
invoked by name. Message passing is key to some models of concurrency and object-oriented
programming.
Message passing is used
ubiquitously in modern computer software. It is used as a way for the objects
that make up a program to work with each other and as a means for objects and
systems running on different computers (e.g., the Internet) to interact.
Message passing may be implemented by various mechanisms, including channels.
Message passing is a
technique for invoking behavior (i.e., running a program) on a computer. In
contrast to the traditional technique of calling a program by name, message
passing uses an object model to distinguish the general function from the
specific implementations. The invoking program sends a message and relies on
the object to select and execute the appropriate code. The justifications for
using an intermediate layer essentially falls into two categories:
encapsulation and distribution.
Encapsulation is the
idea that software objects should be able to invoke services on other objects
without knowing or caring about how those services are implemented. Encapsulation
can reduce the amount of coding logic and make systems more maintainable. E.g.,
rather than having IF-THEN statements that determine which subroutine or
function to call a developer can just send a message to the object and the
object will select the appropriate code based on its type.
One of the first
examples of how this can be used was in the domain of computer graphics. There
are all sorts of complexities involved in manipulating graphic objects. For
example, simply using the right formula to compute the area of an enclosed
shape will vary depending on if the shape is a triangle, rectangle, elipse, or
circle. In traditional computer programming this would result in long IF-THEN
statements testing what sort of object the shape was and calling the
appropriate code. The object-oriented way to handle this is to define a class
called Shape with subclasses such as Rectangle and Ellipse (which in turn have
subclasses Square and Circle) and then to simply send a message to any Shape
asking it to compute its area. Each Shape object will then invoke the
subclass’s method with the formula appropriate for that kind of object.[1]
Distributed message
passing provides developers with a layer of the architecture that provides
common services to build systems made up of sub-systems that run on disparate
computers in different locations and at different times. When a distributed
object is sending a message, the messaging layer can take care of issues such
as:
Finding the app using
different operating systems and programming languages, at different locations
from where the message originated.
Saving the message on a
queue if the appropriate object to handle the message is not currently running
and then invoking the message when the object is available. Also, storing the
result if needed until the sending object is ready to receive it.
Controlling various
transactional requirements for distributed transactions, e.g. ACID-testing the
data.[2]
One of the most
important distinctions among message passing systems is whether they use
synchronous or asynchronous message passing. Synchronous message passing occurs
between objects that are running at the same time. With asynchronous message
passing it is possible for the receiving object to be busy or not running when
the requesting object sends the message.
Synchronous message
passing is what typical object-oriented programming languages such as Java and
Smalltalk use. Asynchronous message passing requires additional capabilities
for storing and retransmitting data for systems that may not run concurrently.
The advantage to
synchronous message passing is that it is conceptually less complex.
Synchronous message passing is analogous to a function call in which the
message sender is the function caller and the message receiver is the called
function. Function calling is easy and familiar. Just as the function caller
stops until the called function completes, the sending process stops until the
receiving process completes. This alone makes synchronous message unworkable for
some applications. For example, if synchronous message passing would be used
exclusively, large, distributed systems generally would not perform well enough
to be usable. Such large, distributed systems may need to continue to operate
while some of their subsystems are down; subsystems may need to go offline for
some kind of maintenance, or have times when subsystems are not open to receiving
input from other systems.
Imagine a busy business
office having 100 desktop computers that send emails to each other using
synchronous message passing exclusively. Because the office system does not use
asynchronous message passing, one worker turning off their computer can cause the
other 99 computers to freeze until the worker turns their computer back on to
process a single email.
Asynchronous message
passing is generally implemented so that all the complexities that naturally
occur when trying to synchronize systems and data are handled by an
intermediary level of software. Commercial vendors who develop software
products to support these intermediate levels usually call their software
“middleware”. One of the most common types of middleware to support
asynchronous messaging is called Message-oriented middleware (MOM).
With asynchronous
message passing, the sending system does not wait for a response. Continuing
the function call analogy, asynchronous message passing would be a function
call that returns immediately, without waiting for the called function to
execute. Such an asynchronous function call would merely deliver the arguments,
if any, to the called function, and tell the called function to execute, and
then return to continue its own execution. Asynchronous message passing simply
sends the message to the message bus. The bus stores the message until the
receiving process requests messages sent to it. When the receiving process
arrives at the result, it sends the result to the message bus, and the message
bus holds the message until the original process (or some designated next
process) picks up its messages from the message bus.[3]
Synchronous
communication can be built on top of asynchronous communication by using a
Synchronizer. For example, the α-Synchronizer works by ensuring that the sender
always waits for an acknowledgement message from the receiver. The sender only
sends the next message after the acknowledgement has been received. On the
other hand, asynchronous communication can also be built on top of synchronous
communication. For example, modern microkernels generally only provide a
synchronous messaging primitive[citation needed] and asynchronous messaging can
be implemented on top by using helper threads.
The buffer required in
asynchronous communication can cause problems when it is full. A decision has
to be made whether to block the sender or whether to discard future messages.
If the sender is blocked, it may lead to an unexpected deadlock. If messages
are dropped, then communication is no longer reliable. These are all examples
of the kinds of problems that middleware vendors try to address.
6.6 CUDA GPU
CUDA is a parallel
computing platform and application programming interface (API) model created by
Nvidia.[1] It allows software developers and software engineers to use a
CUDA-enabled graphics processing unit (GPU) for general purpose processing – an
approach termed GPGPU (General-Purpose computing on Graphics Processing Units).
The CUDA platform is a software layer that gives direct access to the GPU’s
virtual instruction set and parallel computational elements, for the execution
of compute kernels.
The CUDA platform is
designed to work with programming languages such as C, C++, and Fortran. This
accessibility makes it easier for specialists in parallel programming to use
GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which
required advanced skills in graphics programming. Also, CUDA supports
programming frameworks such as OpenACC and OpenCL. When it was first introduced
by Nvidia, the name CUDA was an acronym for Compute Unified Device
Architecture, but Nvidia subsequently dropped the use of the acronym.
The graphics processing
unit (GPU), as a specialized computer processor, addresses the demands of
real-time high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs
had evolved into highly parallel multi-core systems allowing very efficient
manipulation of large blocks of data. This design is more effective than
general-purpose central processing unit (CPUs) for algorithms in situations
where processing large blocks of data is done in parallel, such as:
push-relabel maximum
flow algorithm
fast sort algorithms of
large lists
two-dimensional fast
wavelet transform
molecular dynamics
simulations
Programming abilities
Example of CUDA
processing flow
Copy data from main
memory to GPU memory
CPU initiates the GPU
compute kernel
GPU’s CUDA cores
execute the kernel in parallel
Copy the resulting data
from GPU memory to main memory
The CUDA platform is
accessible to software developers through CUDA-accelerated libraries, compiler
directives such as OpenACC, and extensions to industry-standard programming
languages including C, C++ and Fortran. C/C++ programmers use ‘CUDA C/C++’, compiled
with nvcc, Nvidia’s LLVM-based C/C++ compiler.[4] Fortran programmers can use
‘CUDA Fortran’, compiled with the PGI CUDA Fortran compiler from The Portland
Group.
In addition to
libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA platform
supports other computational interfaces, including the Khronos Group’s
OpenCL,[5] Microsoft’s DirectCompute, OpenGL Compute Shaders and C++ AMP.[6]
Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby,
Lua, Common Lisp, Haskell, R, MATLAB, IDL, and native support in Mathematica.
In the computer game
industry, GPUs are used for graphics rendering, and for game physics
calculations (physical effects such as debris, smoke, fire, fluids); examples
include PhysX and Bullet. CUDA has also been used to accelerate non-graphical
applications in computational biology, cryptography and other fields by an
order of magnitude or more.
CUDA provides both a
low level API and a higher level API. The initial CUDA SDK was made public on
15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later
added in version 2.0, which supersedes the beta released February 14, 2008.
CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce,
Quadro and the Tesla line. CUDA is compatible with most standard operating
systems. Nvidia states that programs developed for the G8x series will also
work without modification on all future Nvidia video cards, due to binary
compatibility.
CUDA 8.0 comes with the
following libraries (for compilation & runtime, in alphabetical order):
CUBLAS – CUDA Basic
Linear Algebra Subroutines library, see main and docs
CUDART – CUDA RunTime
library, see docs
CUFFT – CUDA Fast
Fourier Transform library, see main and docs
CURAND – CUDA Random
Number Generation library, see main and docs
CUSOLVER – CUDA based
collection of dense and sparse direct solvers, see main and docs
CUSPARSE – CUDA Sparse
Matrix library, see main and docs
NPP – NVIDIA
Performance Primitives library, see main and docs
NVGRAPH – NVIDIA Graph
Analytics library, see main and docs
NVML – NVIDIA
Management Library, see main and docs
NVRTC – NVIDIA RunTime
Compilation library for CUDA C++, see docs
CUDA 8.0 comes with these
other software components:
View – NVIDIA nView
Desktop Management Software, see main and docs
NVWMI – NVIDIA
Enterprise Management Toolkit, see main and docs
PhysX – GameWorks PhysX
is a multi-platform game physics engine, see main and docs
Sumber :
https://www.tutorialspoint.com/parallel_computer_architecture/parallel_computer_architecture_models.htm
https://en.wikipedia.org/wiki/Thread_(computing)
http://www.mcs.anl.gov/~itf/dbpp/text/node7.html
https://en.wikipedia.org/wiki/CUDA
https://en.wikipedia.org/wiki/Message_passing
https://en.wikipedia.org/wiki/Distributed_computing
Tidak ada komentar:
Posting Komentar