Wednesday, August 6, 2008

Preface of algorithms

STUDENT REFERENCE NOTES FOR AN UNDERGRADUATE LEVEL COURSEWARE
ON ANALYSIS AND DESIGN OF ALGORITHMS
TCS-503
(UPTU syllabus-V Semester)
Department of Computer Science and Engineering at
Meerut Institute of Engineering and Technology and Meerut Institute of Technology, Meerut.

LETTER OF APPROVAL
TO WHOMSOEVER IT MAY CONCERN
This Letter states that the reference notes drawn thereafter on the subject of 'Analysis and Design of Algorithms' by Amit Nigam, a undergraduate student of B.Tech, Computer Science and
Engineering at MIET, Meerut are thoroughly revised and approved by the Department of
Computer Science and engineering and Information Technology at MIET, Meerut.
The Approval ensues the correctness of the theory, algorithms, computations and notes prepared, however it excludes the printing and spelling errors (if any).
The department congratulates him and wishes him all the Best.
Prof Sunil V. K Gaddam. Abhilash Jain
HOD, Dept. of CSE, MIET. Sr. Lecturer, Dept. of CSE..

"A scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it." - Maxwell Planck
LET'S HAVE A TALK... Like all Software engineering processes, the study of analysis and design of algorithms is driven
by the need of efficiently solving many practical computational problems and applications
encountered in almost all intra and inter disciplinary aspects of computer science and
engineering. The word 'efficiently' above deserves a few more lines here.
In a first note, algorithm efficiency means how much space (both auxiliary and immediate) and
computing time the program (algorithm) would need from beginning of its execution till its
completion. The point to talk about is that both computing time and space in computer memory
is a bound resource. Our digital computers, however fast are not infinitely fast, and the storage,
however large and cheap, is not indefinite and free. Besides these performance constraints,
Software engineering process has its own limitations of design, documentation, data abstraction,
modularity, error handling etc. An efficient algorithm thus would be one that takes care of all
these issues. Any wrongly conceived algorithm, on the other hand, can take much space or may
terminate in a time that far exceed its expectations. Deciding on what algorithm would be best
for a particular problem is the central theme of the design and analysis of algorithms.
Problem instances are not rare when different algorithms exist for a single problem. In such
cases, we can't stick to a particular solution. instead deeper analysis of efficiency of all the
solutions as well as much deeper analysis of the problem itself is required. for example both
insertion sort and merger sort solves the same sorting problem over a sequence of n numbers.
Than which solution to opt for. Let's see an instant solution
Insertion sort uses an incremental approach while merge uses divide and conquer to sort the
given array. Both of them are different approaches.
Insertion sort takes time = c1*n^2 i.e . n^2 while merge sort upper bound in worst case time is
=c2*nlgn.(or O(nlgn) ) where c1 and c2 are constant that does not depend on n.
Insertion sort usually has a smaller constant factor than merge sort so c1significantly smaller inputs, insertion sort is faster. but once the size of the input grows, merge
sort obviously shows its advantage of smaller 'lgn' as against the much larger factor 'n' of insertion sort in its running time. So at larger inputs merge dominates. This Analysis of both of
them (problem as well as algorithm) reveals that it would be clever to use insertion sort within
merger sort so that when sub problems grows small, faster insertion sort is used.
Even if we have a particular efficient solution (on space and time scale) for a problem, choosing
it straightaway can do more harm than good. Software Engineering experiences in building large
systems have shown that difficulty in implementation, quality and performance of the final result
depends heavily on carefully choosing the best data structures also. After the data structures are
chosen, the algorithm to be used often becomes obvious. however, sometimes it is the algorithm
that governs the choice of the data structure as some algorithm works best with some data
structures only. This point will be more clear when studying data structures for disjoint sets.
Applications What makes the subject the most interesting discipline of the entire computer engineering (atleast according to me) is the fact that there is no line drawn in any engineering discipline where you
can say that "from this point, algorithms are of no use".Algorithms are encountered almost
everywhere, where there is any phenomenon related to computers and programming.
In computer Operating systems, as early as 1960s when the concept of multitasking operating
system developed, problems of process synchronization and scheduling surrounded. Then
Solutions to classical problems of synchronization including producer-consumer problem, critical
section problem, readers writers problem, dining philosophers etc have been carefully studied
and efficient algorithms were devised to solve them. CPU and Disk scheduling are other two
main aspects of Time sharing systems which are achieved using efficient scheduling algorithms
like SJFS, round robin etc. Strategy to choose algorithms become more complex when we dealt
with multiple processor systems.We used parameters like throughput to compare algorithms and analytically evaluate them using several deterministic models. Some applications of graphs,such as Bankers algorithm, are also used while handling deadlocks.
Talking about graphs, is to say that, many graph problems such as covering and partitioning
problems, graph coloring problems, shortest paths, vertex ordering and problems of isomorphism
also seek algorithms for their solution while graphs themselves have their own interesting history
of providing good solutions to some of the most long standing problems of CS. Teleprinter's
problem that was a long standing problem of communication theory was solved using an
application of Euler digraph in 1940. Another called Konnisbridge problem has a similar older
history.
In Computer networks, need less to say, the role of algorithms is indispensable. They play a
major role in providing solutions to many problems in computer networks theory. Issues of
channel allocation, congestion control, routing (using shortest paths algorithms like Dijkstra,
Bellmanford), rendering maps and interpolating addresses are vital to every network design.
In the theory of compiler construction, error detection & handling, code optimization , symbol
table mapping, parsing and other design issues are answered by clever algorithms only.
Storing and retrieval of the astronomical volumes of data that is stored on the internet in different
data bases and tools for data analysis require sophisticated algorithms.
Public key cryptography and techniques of digital signature in e-commerce are among the chief
technologies that makes use of numerical algorithms and number theory.
The hardware design, in electronics, manufacturing and IT industry use algorithms. The design of GUI relies on algorithms. Computational biology makes use of algorithms and so on.
Theoretically we see algorithm as a tool for solving well-defined computational problems but
place of algorithm in modern computing is not less important than advanced technologies like
fast hardwares with high clock rates, pipelining and super scale architecture, interactive GUIs,
OOS, advanced LAN and WAN networking technologies etc.
Although there are applications that do seem to require algorithms at the application level but
they do so extensively on the bed. All these applications are written in one or the other language
that finally translate into machine language. They are then processed by compiler, interpreter or
assemblers all of which use algorithms at the core of their functioning.
There are, however problems, in theoretical computer science for which no efficient algorithms
are known. An important subset of these are known as NP-complete problems. We try to break
such problems into either subproblems for which there is an efficient algorithms or try to find
optimization solutions for smaller class of inputs. We shall take this later.
Prerequisites As already said, an important Prerequisites to study and design algorithms is a strong
mathematical background in various concepts. A good list (although may not be exhaustive) is
prepared for below.


Concept from discreet mathematics :sets and their laws,relations and groups.
Functions: monotonicity, floors and ceilings, exponential, log,iterated log,factorial,modular
arithmetic functions, interpolation and approximations.
Polynomials and Polynomial equations :
Polynomials : asymptotic behavior of polynomials, DFT and use of FFT to reduce time of
polynomial multiplication, convolution theorem, FFT using modular arithmetic
Polynomials equations :
Single variable
1. ( Linear equations -algebric and trancendental)
direct methods - quadratic, cubic (Carden's and Horner's method) and
biquadratic equations.
indirect methods or approximation methods or iterative numerical
methods (genarally for higher degree transcendental equations ) - (Bisection, Iteration,
Regula-falsi, Secant,Newton Raphson, Lin-Bairstow's, Muller and Graeffe's root squaring
method.)
Multi variables (>=2 variables ) or system of simultaneous algebric equations
Here we have more than one equations or a system of simultaneous algebric equations. we
usually represent such a system using Matrices. In connection with Matrices, there arise two
numerical problems besides finding the solution of equations by methods as briefed up below.
the first is finding the inverse of a Matrix using gauss elimination, Gauss Jordan, Crout's method
partition method and the other is to finding the eigen Values and corresponding eigen vectors of
a Matrix where we come across methods like Caley Hamilton Theorem, Power method, Lacobs
method, Given's method and House Holder's method. Computation for the eigen values and
simultaneous linear equations are required in many engineering and scientific problems.
1. Simultaneous non-linear equations.
indirect methods-method of iteration, complex roots by Newton Raphson
2. Simultaneous linear equations
direct methods-Cramers rule,matrix inversion methods,Gauss
elimination, Gauss Jordan, Crout's (LU) method
iterative methods-Jacob's method, gauss-Seidal, Relaxation and illconditioned
system of equations.
Series and Summations
Elementary number theoretic notations: Basic concepts : divisibility, modular equivalence,
unique factorization ,GCD and euclids algorithm , Lame's theorem, concepts of modular
arithmetic (finite groups eg abelian group, lagrange's theorem, chinese remainder theorem),
integer factorization.
Combinatorics-permutation and combination, binomial coefficients, stirlings formula
Probability : axioms on probability distributions (discrete and continuous uniform probability
distributions), Bayes theorem, Boole's inequality, discrete random variables: discrete random
variables, probability density functions, property of linearity of expectation, variance and
standard division of random variables, geometric and binomial distributions.
Numerical analysis is the study of algorithms for the problems of continuous mathematics (as
distinguished from discrete mathematics). Numerical analysis naturally finds applications in all
fields of engineering and the physical sciences, but in the 21st century, the life sciences and even
the arts have adopted elements of scientific computations. Ordinary differential equations appear
in the movement of heavenly bodies (planets, stars and galaxies); optimization occurs in
portfolio management; numerical linear algebra is essential to quantitative psychology; stochastic
differential equations and Markov chains are essential in simulating living cells for medicine and
biology.
Analysis of Algorithms We chiefly focus on studying and analyzing asymptotic efficiency of algorithms ie how the
running time of an algorithm increases with the size of the input in the limit, as the size of the
input increases without bounds. We chose algorithms for large inputs that are asymptotically
more efficient. To study this, several asymptotic notations have been developed to make the
analysis easy. Aho, Hopcraft and Ullaman also advocated the use of this analysis for comparing
relative performances of algorithms.
Design strategies
Dynamic programming is suited for problems that show optimal sub structure so that re
computation of a sub-subproblems is avoided as unlike another technique Divide and conquer
where every subproblems are independent so that everytime they need to be computed
recursively. The equation of the running time of algorithms here are obtained in the form of
recurrences (function values in terms of its values at smaller inputs) that are solved using various
recurrences techniques Interesting to see that recurrences were studied as early as 1202 by L.
Fibonacci for whom the Fibonacci numbers are named. On the other hand, in dynamic
programming, the re computation is avoided by maintaining a table of results, better known as
memoization. It's great advantage is that it reduces exponential complexity of many problems to
polynomial complexity but unfortunately its not a solution to all complex problems.Assembly
line scheduling, matrix chain multiplication,longest common sub sequence are some example
better understood in context of dynamic programming. An important and simpler variant of
Divide and conquer is Decrease and conquer used by binary search algorithm.
Another approach, similar to dynamic is Greedy where we do not look for an exact solution for
solving sub problems steps instead a greedy choice is made of what looks best for the moment.
Greedy is not always the best solution but when is works correctly, it is the fastest. Kruskal
algorithm for finding MST is perhaps the best known example of it other we would study include
activity selection, task scheduling, Huffman codes etc.
Greedy and dynamic are infact two independent methods to solve a higher class of problems
called optimization problems. These problems go through a series of steps with a set of choices
at each step. Optimization finds its application in the design of Huffman codes for data
compression, optimal solution of matroids, MST, shortest path from single source, TSP,
knapsack 0-1, Chvatal's set covering heuristics to name a few.
Avid readers would question that how it is possible to know whether a problem can be best
solved by a greedy approach or dynamic. It is of course not known, so we first use dynamic
approach to find a solution and then show that we can always make greedy choices using greedy
algorithm to arrive at an optimal solution.
We also study other design methods such as Amortized Analysis where the cost of all operations
is averaged using methods such as accounting and potential. Linear programming is for
problems like maximum flow and Integer programming which is complex variant of linear
programming is where the solution space is restricted to all integers. Linear programming and
Randomization are useful techniques besides greedy in the design of approximation
algorithms eg 3CNF problem uses randomization to get an optimized version of the problem
and vertex cover problem uses linear programming for the same.TSP, set covering, subset-sum
are other problems that exploit either approximation or greedy. Some of them NP-complete
problems, as we shall see, that have only possible route i.e to find a near optimal solution using
approximation algorithm. Yet some problems use Reduction or Transform and conquer where
the goal is to transform a problem into another simpler such that the complexity before does not
dominate the complexity after, Search and enumeration, Backtracking, Hill climbing where
the basic idea is to start with a poor solution to a problem and then repeatedly apply
optimizations to that solution until it become optimal or meets some other requirements. It is
used in network flow problems. Probabilistic analysis is another powerful way of analyzing
algorithms that incorporate random behavior on the distribution of inputs. We call these
algorithms as randomized algorithms such as Hiring problem and Birthday paradox. Use of
probability in analyzing algorithms, for eg the average case analysis of quicksort, bucketsort, and
the order statistic algorithm use probability, however use of probability puts several restrictions
on the input and that's not easy. Randomization and probabilistic analysis have become
fundamental tools in modern Computer Science, with applications ranging from combinatorial
optimization to machine learning to cryptography to complexity theory and to the design of
protocols for communication networks. Often randomized algorithms are more efficient, and
conceptually simpler and more elegant than their deterministic counterparts.Probabilistic and
heuristic approaches is by far my most favorite design strategy that will be taken in some what
more details later.
Yet another simpler strategies include Recursion and iteration, often taken as same are not
exactly. However, every recursive version has an equivalent iterative version and vice versa. The
knowledge of what implementation, whether recursion or iteration, is to be used depends entirely
on the problem eg Tower of Hanoi is analyzed using recursion only.
Besides straight forward algorithm design strategies like divide and conquer and dynamic,
we also use data structures, often, as a tool to devise efficient algorithms. Amongst many,
Heapsort algorithm is a perfect example of it that uses the property of heap. This is because most
often, the algorithm we design tend to require operations that several data structure support.We
then tend to use such data structures directly in the algorithm and implement the operations of
that data structure as a 'sub-step' in the algorithm. For example when an algorithm uses a data
structure say binary heap, it means that it inherently uses one, two or all of the operations (such
as insertion, deletion or search) of the underlying data structure in its steps. The total time then
comes out to be the time taken by algorithm + time taken by such operations. This brings out the
indispensability of asymptotically analyzing all the operations of the data structures besides
analyzing the algorithm themselves. Consider another example..
There are several operations that we perform on dynamic sets eg insertion, deletion, test
membership of an element, successor, predessor etc.The issue of which data structure we may
use to represent these sets depend on what 'set of operations' we would use. We may use data
structures such as stacks, queues, Linked list and rooted trees eg BST and its variants to represent
dynamic sets but when we need only to implement dictionary operations (insert, delete and
search), we use hash tables to represent dynamic sets as hashing calculates to much better
asymptotic efficiency.
In many other cases it has been proved that significant improvements in efficiency can be
obtained by selecting (or designing) appropriate representations of the data structures employed
and the algorithms for their manipulation, without requiring any changes to the underlying
algorithms.
Another important use of data structures is in the implementation of disjoint sets as some
applications involve grouping n distinct elements into a collection of disjoint sets. We implement
these disjoint sets as data structures. One of the many applications of disjoint set data structure
arise in determining the components of an undirected graph.
Classification
From one point of view, algorithms can be either deterministic or non-deterministic depending
on whether they make use of clever choices or heuristics or not. The latter definitely use
heuristics and other techniques for making typical and accurate guesses.
But actually, computer scientists classify problems (and not algorithms) into equivalent classes
based on the complexity of the best possible algorithm for them.
Computational complexity theory, as a branch of the theory of computation in computer science,
investigates the problems related to the amounts of resources required for the execution of
algorithms (e.g., execution time), and the inherent difficulty in providing efficient algorithms for
specific computational problems. Much of complexity theory deals with (1) decision problems.
A decision problem is a problem where the answer is always yes or no. Complexity theory
distinguishes between problems verifying yes and problems verifying no answers. A problem that
reverses the yes and no answers of another problem is called a complement of that problem.
The theory of computational complexity establishes decision problems into three main
complexity classes P (polynomial), NP (non-deterministic polynomial) and NPC (NP-complete).
A complexity class is the set of all of the computational problems which can be solved using a
certain amount of a certain computational resource.
Class P or Tractable or easy problems are set of those problems that are solvable in polynomial
time by a deterministic machine ie on inputs of size n, their worst case running time is O(n^k)
for some constant k
The complexity class NP is the set of decision problems that can be solved by a nondeterministic
machine in polynomial time. Class includes the Boolean satisfiability problem, the
Hamiltonian path problem and the vertex cover problem etc.
Class NPC, is a subset of NP are the most interesting set of decision problems because neither
any polynomial time algorithm exists for them nor we have been able to prove that whether such
an algorithm exists or not. But these problems have two interesting properties.One, If any single
problem in NP-complete can be solved in polynomial time, then every problem in NP can also be
solved in polynomial time. If ever we find a polynomial solution to any NPC problem, than due
to this property one, every problem in NP can be quickly solved. So we would say that It is not
known whether every problem in NP can be quickly solved -- this is called the P = NP problem.
Nobody has yet been able to determine conclusively whether NP-complete problems are in fact
solvable in polynomial time, making this one of the great unsolved problems of mathematics.
The Clay Mathematics Institute is offering a $1 million reward to anyone who has a formal proof
that P=NP or that P.NP. Second property is, their solutions can be verified in polynomial time.
But from where does this solution comes which we are verifying. To verify the solution, we
somehow give a 'certificate' of the solution and then prove that the certificate is correct in
polynomial time in the input size. NP-complete problems are studied because the ability to
quickly (in polynomial time) verify 'solutions' to a problem using certificates seems to correlate
with the ability to quickly solve that problem itself.
There is another class of problems called (2) Intractable (Hopcroft, et al, 2007: 368) or hard
problems are those that require superpolynomial time. To see why exponential-time solutions
might be unusable in practice, consider a problem that requires 2^n operations to solve (n is the
size of the input). For a relatively small input size of n=100, and assuming a computer that can
perform 10^12 operations per second, a solution would take about 4×10^10 years, much longer
than the current age of the universe.Complexity theory assumes problems that lack polynomial-time solutions are intractable for more than the smallest inputs. Problems that are known to be
intractable in this sense include those that are exponential TIME-complete. If NP is not the same as P, then the NP-complete problems are also intractable in this sense. What this means "in practice" is open to debate. Then there are (3) problems that can't be solved by any computer, no matter how much time is provided such as Turing's famous Halting problem.
Decision problems seems to stand opposite of optimization problems in which we always try to
find the best optimal solution so that we don't say 'YES' or 'NO' quickly to any particular
solution. in this sense, NP-completeness applies directly to decision problems but no to
optimization problems.
I would love to cover this entire manual from theory of NP-completeness but that's not feasible
so more details and lemmas later.
Implementation of algorithms
Algorithms are essentially implemented in one or the other language (HLL eg Perl, JAVa , C++
or LLL) and are designed to function in either serial, parallel or distributed environments.
algorithms have to incur additional costs of communication overhead if they are used for
distributed environments while algorithms that are designed for serial environments execute only
one instruction at a time. parallel models have benefit in a sense that several processor works on
the same problem at the same time. parallel and distributed techniques divide a problem into
more symmetric/asymmetric sub problems and get back the results
There are 3 major types of machines, depending on their representation of data in the memory:
· Pointed Machines – these machines represent data as pointers only, no arrays are defined
in these machines.
· RAM model – these machines represent data as arrays. In all the lectures, it is assumed
that we are working with these machines.
· PRAM model – this model is used in parallel computing, and is similar to the RAM
model.
the algorithms given here are all based on RAM model of computation which is given as a
separate study.
After all this talking, it is needless to say now that Algorithms are an essential part of EVERY
computer system (the capitalization does matters) and so as the need to master its concepts for
EVERY CS student (the capitalization does matters again),,
Its important because if we ever think of any innovation, we should not be lacked by
fundamentals that really begin them.
Hope you've enjoy talking with me and enjoy developing your concepts too.
Thanks !
Amit Nigam
er.nigamark@zapak.com

No comments: