LHAPDF6 design

Introduction

There are several reasons for this rewrite and the migration to C++, the main ones being:

Hugely reduced memory overhead
- Although modern Fortran compilers can now dynamically allocate memory, LHAPDF was written with F77 and static memory allocation in mind. The memory overhead of LHAPDF5 is therefore proportional to the number of supported PDF sets, then multiplied again by the number of simultaneous PDF sets that can be used. The uninitialized static memory footprint is hence huge: at more than 2 GB it is incompatible with running LHAPDF on the Grid in "full memory" mode. This excludes use of standard LHAPDF builds for PDF reweighting and other tasks on the standard LHC computing resources. Dynamic memory allocation in LHAPDF6 entirely solves this problem, while removing all restrictions on the number of simultaneous PDF sets.
Speed
- Early tests show that LHAPDF6 can be substantially faster than LHAPDF5 in event generation, largely due to the ability to now evolve each PDF flavour independently.
Encapsulation
- The multi-set features of LHAPDF5 were added in retrospect, after the writing of PDF set wrapper codes which assumed only one PDF set would be used at one time. As a result, the parallel use of multiple sets returns correct PDF values, but does not work correctly for "metadata" such as alpha_s, number of flavours, etc. The new version stores such information in Info and AlphaS objects which are strictly bound to the PDF being accessed, so these problems of global state are no longer an issue.
Generality
- Previous versions of LHAPDF evolved slowly over time. The original code was not written with certain added features in mind. Taking the above case of multiple PDF loading as an example; the current Fortan version of LHAPDF has specific functionality to deal with this case. The C++ version that we propose deals with multiple PDF loading as standard, using the same functionality as for the single PDF case. Other examples of this include dealing with special parton cases such as photons. Although this has little to do with language limitation; a rewrite with these general cases in mind, in any language, will allow for clearer code. Along with this, current Fortran code has been modified extensively to deal with different PDF file formats, each with their own parameter space and interpolation rules. One of the largest improvements implement in this C++ version is a unified PDF file format, with general parton flavour content (using the PDG MC particle ID scheme). This removes the need for special wrapper codes for each PDF, avoids the need for special functions to access e.g. photon PDFs, and means that new PDF sets can be made available without requiring a new LHAPDF release.
Extensibility
- Extensions to the current Fortan code are difficult considering how much it has evolved since its conception. By exploiting object orientated practises, the C++ version will allow for the easy extension of several features. An example of this is the Interpolator interface. By implementing this; users can add to the current list of general interpolation methods that can be called on any PDF set, rather than dealing with specific interpolation rules being included in the individual authors' wrapper files. The modular nature of the program means that any unforeseeable requirements in the future can, hopefully, be implemented easily.

Terminology

The core element of the API is the PDF interface, representing a single set member for several flavours.

Note that the historical/community terminology for levels of "PDF" is rather vague, leading to frequent confusion. The term "PDF" may, depending on the speaker, mean any of the following:

a single parton density function for one parton flavour;
a PDF set member, consisting of a correlated group of parton densities, one for each parton flavour;
a set of PDF members, usually reflecting uncertainties encountered in PDF construction.

The second of these, the "member", is the most commonly used PDF object as it is usual to choose a point in (x,Q2) space and then use the parton density values for each parton flavour to determine the outcome of a Markovian process step.

However, the name "member" is not in active usage for this concept. Hence in this version of LHAPDF we use the term "PDF" to mean a member, and "PDF set" to mean a collection of these. No explicit term is defined for a single-flavour PDF but we tentatively reserve the terms 1FPDF and PDF1F (the latter for C/C++ validity) for this purpose.

The main PDF object hierarchy

PDF objects are allocated dynamically (either locally or on the heap). None of the static memory issues from Fortran LHAPDF. Potential for use of singleton allocation based on unique data paths.

PDF data to be separated from the framework: new PDFs should not require addition of new handling code, nor even a new release of LHAPDF.

Data versioning is hence needed, because a library version does not imply a particular bug-fix version of a PDF set's data files. In fact this is already an unaddressed issue for LHAPDF5. We cannot automaticaly enforce good data provenance, but metadata flags are available for use in PDF sets which declare both the integer version of the set data and the integer-encoded version of the LHAPDF library required to use it. Set authors are encouraged to use both.

Flavours are identified by standard PDG ID codes. A PDF can contain arbitrarily many flavours. A special case is ID = 0, which for backward compatibility and convenience is treated as equivalent to ID = 21 (this allows a for-loop from -6 to +6 to cover all quarks and the gluon without need of special treatment to replace the 0 index with 21).

PDFSet objects are singletons used only for set-level metadata querying and for convenience loading of all PDFs in a set. They inherit from the general Info class used for metadata handling.

A "return zero" treatment for unsupported flavours, based on global, set-level, or member-level configuration, is used by default. It is also possible to request that an exception be thrown on attempting to access an undefined member (note that this will include top quarks in almost all PDFs, as they are not explicitly tabulated... hence the default behaviour of assuming that an unlisted flavour has zero PDF density.).

The main PDF type is the "grid" PDF, based on interpolation of rectangular grids of data points.

Subgrids* are an important feature: these are distinct PDF interpolation grids binned in Q2, so that gradient discontinuities (and value discontinuities, for NNLO PDFs) across flavour thresholds can be handled correctly. This is at least needed by the MSTW PDFs.

Note: Subgrids in x have also been suggested, but just ensuring sufficient x knot density seems preferable: unlike flavour thresholds, continuity is important, and this is complex to implement, and anyway the benefit is more for evolution than for interpolation.

A PDFSet object is provided for handling of set-level metadata and to provide a convenient interface for loading all members in a set.

Interpolators

The default log-bicubic is standard, in general using Q2 subgrids (if present) to handle discontinuous transitions in Q2 across flavour mass thresholds. Log interpolation? Named interpolators overrideable in code, but with defaults specified in the set/member metadata.

Caching: interpolators should be bound to a PDF object (and be unique, i.e. no singletons) so that they can cache looked-up/interpolated values. The main use-case for this is flavour caching, where all flavours in a PDF will be needed at a single (x,Q2) point. Since each flavour grid needs to be separately interpolated, ALWAYS calculating all flavours would be wasteful, so the indices of the surrounding grid points (including implicit identification of grid edges) may be cached so that the lookup need not be done for each flavour as they are all defined on the same (sub)grid: this can be done in the interpolator base class or calling code. The interpolation weights can also be cached, but this is specific to the interpolator algorithm.

Extrapolators

Extrapolators: extrapolation may be required but is not advised. Essentially a damage minimisation exercise. Specification via metadata or explicit code as for interpolators.

The default behaviour is to "lock" or "freeze" the PDF value at the edge of the fitted grid, rather than to truly extrapolate. An alternative extrapolation handler is provided which throws an exception if extrapolation is required. Further extrapolators can be user-defined, but no extra "standard" extrapolators are currently planned.

Extrapolators, like interpolators, are bound to a GridPDF object. This also makes it possible for an Extrapolator to use the currently bound Interpolator object in its calculation of the extrapolation: this is used by the default nearest-point extrapolator. In principle extrapolation could also allow PDF-specific lookup caching, but this is unlikely to be necessary (and to an extent will occur automatically via caching on the interpolator).

Data format

Each PDF set is defined by a directory (conceivably support could later be added for zip or tar archives whose contents have a directory structure) containing data files. Each PDF member is stored in a text file with the name of the set followed by an underscore and a four digit number (including leading zeros if needed) and the extension .dat, e.g. <setname>/<setname>_0031.dat

The head of the file is reserved for member-level metadata in the YAML format: this section ends with a sequence of three dashes (---) on a line of their own. This YAML metadata section is mandatory for all PDF formats, but the format following the --- divider line may vary. The type of PDF (and hence the data block format) must be declared via the YAML "Format" flag in the PDF member file or set info file. (The latter has the path form <setname>/<setname>.info and contains a single YAML document.)

Plain text rather than binary files are mandated for ease of creation, human readability/debuggability, and because standard compression tools are available for later addition to the LHAPDF system if runtime data file size is a serious issue. Separate files are used for data reading efficiency if only a few members of a large set are needed, and all files have names including the set name so that they retain a clear identity when removed from their containing directory, e.g. if exchanged as email attachments.

For grid PDFs, the following file content is in a grid format uniform to all PDF families: subgrids are delimited by more --- line separators, and within each subgrid the first two lines are lists of the x and Q knot values respectively used in that grid.

Note: While Q2 is the representation of the renormalization scale used inside the LHAPDF library (since generator and other codes will typically query the PDF via the squared scale and it would be inefficient to have to call sqrt(q2) every time), in the data files the unsquared Q form is used both for PDF and alpha_s interpolation knot positions. This is to improve the readability of the files, since unsquared values are more easily identified with the quark and Z masses.

The following lines are the xf values of the PDFs for each supported flavour, each line representing one (x,Q) grid anchor point. The order of the xf lines is that of a nested pair of loops, the outer over x knots and the inner over Q knots – hence each subgrid data block is a series of line groups, each with a single x value but different Q values in its constituent lines. The flavours for which the xf values are listed in each line are specified as set (or member) metadata, and the xf values are listed in increasing order of PDG ID code (e.g. usually -5..5, 21 or -6..6, 21 for the standard 5 or 6 quark-flavour PDF). The final line must be another --- delimiter to unambiguously declare the end of the final subgrid block.

Scientific formatting of floating point numbers should be used throughout the data block, and as required in the metadata block. The exponent character should be E (or e) rather than D, and the set preparer is responsible for ensuring that the values are entered with sufficient precision for the required numerical performance.

Lines which begin with a # symbol will be treated as comments and excluded from the format parsing. Partial line comments are not allowed: the # symbol must be the absolute first character on the line otherwise it will be treated as part of a data line.

New PDF formats may be proposed to the LHAPDF maintainers, and will be considered for standardisation. Acceptance of proposals is not guaranteed, and modifications may be insisted upon. We do appreciate the effort, but need to ensure that new standard formats are kept to a maintainably small number, and that such formats are sufficiently clean and general that they can conceivably be maintained forever.

Metadata: the Info system

Info objects for metadata and configuration handling: cascading from global settings down to member-level settings for flexibility. Able to be read in YAML format from any file, stopping parsing at the — marker: this allows metadata to be read for many (including all) sets simultaneously without incurring the memory penalty of having loaded many data grids. Allows automatic documentation for web, PDF (via LaTeX), etc.

Todo: Documentation system for PDFs – output for pick-up by Doxygen?

The global configuration is specified via the $prefix/lhapdf.conf file. Settings can be overridden in the code via a singleton Config object.

PDF set-level info can be supplied as an override for the defaults via the PDFSet objects. These inherit from the Info base class, as does Config and the member-level PDFInfo type. PDFSet objects are also singletons – at most one can exist for a given set name. This means that after all members of a set have been loaded as PDF objects, their behaviours specified at set-level can all be changed via the shared PDFSet object. If the members' PDFInfo objects specify a metadata flag, that value is the one that will be used, of course, even if a set-level version of the same flag is explicitly reset.

The .info file allows for PDF sets to be versioned (to permit trackable updates of a grid to fix bugs or improve the interpolation knot positions) using the integer DataVersion metadata flag. A negative value of this flag indicates that this PDF set is not suitable for production use, and the library will print a warning message to the terminal in this case.

QCD alpha_s evolution

alpha_s may be calculated by several methods: analytic approximations, numerical solution of the evolution ODE (both with use of flavour threshold treatments), or by interpolation of tabulated datapoints in the set/member metadata.

The interpolation approach uses Q and alpha_s knot values stored as "pure metadata" in the header for uniformity of treatment between different (hypothetical) PDF grid formats. Interpolation subgrids across flavour thresholds are supported, in this case with the subgrid boundaries declared by repeated consecutive values in the Q knots array. Cubic interpolation in log(Q2) is used, with fixed value extrapolation at high-Q. The ODE calculator solves the ODE only once, then saves the result as an interpolation grid for (much) improved efficiency.

QCD params related to alpha_s are specified in the metadata: AlphaS_Lambda4, AlphaS_Lambda5, etc. AlphaS_MZ, NumFlavors, FlavorScheme, quark masses, QCD order (as an int = number of loops), and the alpha_s solver name. See the CONFIGFLAGS document for the full list.

Note: The AlphaS objects are all designed to have no dependence on other LHAPDF objects: they can be happily used in non-PDF contexts. The default parameters are overridden by the PDF objects when creating

Laziness

Interpolators, extrapolators, and AlphaS objects are only loaded when they (or a value calculated by them) are requested: this means that PDF objects loaded only for metadata reasons do not waste time or memory on creating calculators which will not be used. It also means that user-supplied versions of these objects may be passed without having to first create and delete a default as specified by the PDF's metadata fields.

Other laziness may be added in time, e.g. lazy loading of PDF data blocks.

Factories

PDF members, interpolators, extrapolators, and alpha_s solvers are all obtainable by name from factories: this allows all configuration of defaults via data/info files rather than needing set-specific code. The factories are based on hard-coded names: we do not anticipate a need for truly dynamic "plug-in" specification of these objects, and such a facility adds significant complication to the framework. For development of new interpolators, or use of personalised ones, all helper object binding designed to be handled by factories may also be explicitly overridden in user code via the polymorphic Interpolator, Extrapolator, and AlphaS interfaces.

Factory instantiation of PDFs will be the normal approach, since return of a reference to the PDF base type obviates the need to know whether a set is based on grid interpolation or something else. This can be determined via set/member metadata obtained without loading the format-specific data.

Memory management and ownership semantics

PDF objects own the interpolators, extrapolators, alpha_s calculators and info objects associated to them. When the PDF object goes out of scope, these will be deleted. The user should not attempt to delete these objects once they have been associated to a PDF, and user implementations of these objects should not attempt to delete the associated PDF.

The freeing of memory associated with objects created in or passed to PDF objects is automatic via std::auto_ptr smart pointers. The LHAGLUE interface uses similar smart pointers to avoid memory leaks.

Backward compatibility

LHAPDF5 and "LHAGLUE" (i.e. PDFLIB interface) Fortran API elements are provided for Fortran generators which won't/can't change their calling code. LHAPDF5 ID codes will be supported and will continue to be assigned: PDF loading by these IDs will be possible and used as the back-end for the Fortran functions.

The LHAPDF5 C++ and Python APIs are not supported in LHAPDF6, as codes which use these will generally be able to update. In C++, parallel support for LHAPDF5 and LHAPDF6 (and potentially future versions) in calling code may be achieved by use of the LHAPDF_MAJOR_VERSION preprocessor macro, which is defined with the integer value 6 in LHAPDF6, and not at all in LHAPDF5. Python compatibility can be achieved more dynamically, e.g. by use of the hasattr() built-in function to test the capabilities of objects and modules.

Migration and regression testing

Continuous validation against Fortran LHAPDF5 will be made using automated scripts. A nominal per-mille (0.1%) maximum deviation will be tolerated to consider a PDF as "validated".

The deviation function may include a tolerance treatment so that larger fractional deviations can be tolerated in regions where the PDF's xf value is anyway so small that it is of no physical importance: an absolute tolerance value of O(10^-5) is suggested for these regions. [The current deviation function is defined as d = ((xf6 + epsilon) / (xf5 + epsilon)) - 1 with abs tolerance epsilon, so that d -> 0 as xf << epsilon.]

The LHAPDF ID code and index system

Index file pdfsets.index. Just two columns: a positive integer "LHAPDF ID" and the name of the PDF set whose first member starts at that ID. Used for lookup of PDF set names by LHAPDF ID, to avoid need for code modification to use this system. In general any ID will be associated with the set whose listed first ID is effectively the integer <= bin edge containing the arbitrary ID, to avoid having to list all PDF members' ID codes. The LHAPDF ID is the natural successor of the PDFLIB numbering scheme and will remain backward compatible with it, i.e. CTEQ6L1 will remain with ID 10042, CT10 with ID 10800, etc.

Some design details

Use of Q2 rather than Q internally and in the API: fits with generator implementation where evolution variables are usually squared, hence avoiding a call to sqrt, and because squaring is a cheaper operation than sqrt.

The methods accepting a Q or Q2 argument explicitly declare which is being used in their names, e.g. xfxQ and xfxQ2, as this could not be otherwise inferred from the identical method signatures.

Search paths: search for set directories, and info and data files in them, based on a fallback path treatment. The default path list should just be the <install_prefix>/share/LHAPDF directory. If defined, the colon-separated LHAPDF_SET_PATH variable will be searched first

Todo: Should the env var overwrite the install prefix or prepend?

Prepending/appending or explicit setting of the paths should be possible via the API.

Todo: Use the global config system for the path handling?

Flavor: the word is spelt the American way in the code for definiteness and consistency! It's just a convention, don't take it personally ;-)

Fix of the historic CTEQ6L1 "CTEQ6ll" naming typo, with backward compatibility in LHAGLUE. A replica of the Fortran PDFLIB and LHAPDFv5 function/subroutine interfaces is provided under the name LHAGLUE. This should behave familiarly for users of the old code, although the newly native C++ interface is much more pleasant and powerful. Multi-set methods are provided in the LHAGLUE interface, with the previous restriction on the number of simultaneously loaded sets removed. The dynamic memory allocation and deallocation is automatic.

Future prospects

New interpolators (e.g. interpolation in log space or separate x,Q2 interpolator functions), addition of DGLAP or other numerically evolved PDF types based on e.g. HOPPET (these are not included by default as there has been a strong trend of PDFs toward purely interpolated grids during the lifetime of LHAPDF5.)

"Indirect representations" of PDFs are anticipated but not directly supported in LHAPDF 6.0: these are grid PDFs in which the grid interpolation is not directly performed on the functions corresponding to physical partons, but on "utility" functions such as separated valence and sea components, which are combined to give the physical results. This may be implemented by making use of the "generator specific" range of PDG ID codes to represent the intermediate "flavors" – several issues are involved in such an extension, so please contact the authors if you wish to implement such a PDF type so that we can ensure that the design meets general scalability and maintainability requirements.

Nuclear PDFs are not supported in LHAPDFv6.0. We hope that the new C++ interface will make the writing of nuclear PDF wrapper classes easy, but do not personally have the experience to do so optimally. Nuclear physicists interested in addition of nuclear PDF capabilities are encouraged to get in touch with the LHAPDF developers to discuss requirements, and concrete proposals are particularly encouraged.