Here is a nice blog by Eric Vallabh Minikel explaining how PCR duplicates arise during Illumina sequencing.
To summarise, what it says is that an early step in Illumina sequencing is to PCR amplify fragments that have adaptors ligated to each end, which amplifies your DNA about 64-fold. The next step after this is to spread the DNA solution across flow cells, with the aim of getting one DNA molecule per flow cell lawn of primers.
[Note: the DNA molecules are attached at random positions to the inside surface of a flow cell, which is covered with a dense lawn of primers; tens of millions of DNA molecules will attach to the flow cell surface, each will form one 'cluster' when bridge PCR occurs].
However, sometimes you get two copies of the same original molecule (say, 2 out of the 64 copies you made of each molecule) which each stick to a different flow cell lawn, and so you'll be reading the same DNA in two different flow cell 'clusters' [each 'cluster' having about 1 million copies of the original fragment, produced by bridge PCR in a tiny region of the flow cell] - these are your PCR duplicates.
In a seqanswers.com discussion, Li Heng (lh3) says that the rate of PCR duplicates is 0.5*m/N, where m is the number of sequenced reads, and N is the number of DNA molecules before amplification. He said that the key to reducing PCR duplicates is to get enough DNA (large N). The more reads you sequence (higher m), the more PCR duplicates you will get however.
In a seqanswers.com discussion, Li Heng (lh3) says that optical duplicates are sequences from one flow cell cluster, that are (incorrectly) identified by software to be from multiple adjacent clusters.
Identifying PCR duplicates and optical duplicates
In a seqanswers.com discussion, Li Heng (lh3) says that PCR duplicates are usually identified after alignment, eg. by identifying read-pairs that have identical 5'-end coordinates.
Li Heng says that optical duplicates can be identified by checking the sequence and the coordinates on the image, and that alignment is not neeed to identify them.
Should we mark (and remove) duplicates from the analysis?
Li Heng says that marking (and removing duplicates) from your analysis is a good idea for SNP calling because you generally have high coverage data. However, he says it is dangerous to mark (and remove) duplicates for RNA-seq or ChIP-Seq where read count matters. He says it would be better to account for duplicates in your read counting model than run a duplicate-marking program.
Thanks to Bhavana Harsha for the link to Eric Vallabh Minikel's blog.