DATA NORMALIZATION
Along with library size and gene length, AGS is a source of bias when conducting gene-centric comparative metgenomic studies. This is described nicely in a 2010 ISME paper by Bezteri et al. Specifically, a gene which is present at equal copy-number will be sequenced more often from a community with a small AGS relative to a community with a large AGS. For example, essential single-copy genes will appear more abundant in communities with small genomes.

To correct for this bias, along with library size and gene length, we propose the statistic RPKG (reads per kb per genome). This is similar to the commonly used statistic RPKM, but instead of dividing by the number of mapped reads, we divide by the expected number of genomes sequenced:

RPKG = (reads mapped to gene)/(gene length in kb)/(genomes sequenced)
   and
genomes sequenced = (total DNA sequenced in bp)/(average genome size in bp)
   and
total DNA sequenced in bp = (read length in bp) * (reads sequenced)


USE CASE
We have two metagenomic libraries, L1 and L2, which contain genomic DNA from two different microbial communities. 
Each library contains 1 million 100-bp reads: 

READ_LENGTH_L1 = 100 bp
READS_SEQUENCED_L1 = 1,000,000 
TOTAL_DNA_L1 = 100,000,000 bp
READ_LENGTH_L2 = 100 bp
READS_SEQUENCED_L2 = 1,000,000 
TOTAL_DNA_L2 = 100,000,000 bp

We use MicrobeCensus to estimate the average genome size of each library: 

AGS_L1 = 2,500,000 bp
AGS_L2 = 5,000,000 bp

Next, we map reads from each library to a reference database which contains a gene of interest G. G is 1000 bp long. 
We get 100 reads mapped to gene G from each library:

LENGTH_G = 1,000 bp
MAPPED_READS_G_L1 = 100
MAPPED_READS_G_L2 = 100

Finally, we quantify RPKG for gene G in each library:

RPKG for G in L1 = (100 mapped reads)/(1 kb)/(100,000,000 bp sequenced / 2,500,000 bp AGS) = 2.5   
RPKG for G in L2 = (100 mapped reads)/(1 kb)/(100,000,000 bp sequenced / 5,000,000 bp AGS) = 5.0
              