PerlGuts Illustrated
Version 0.01

This document is meant to supplement the perlguts(1) manual page that comes with Perl. It includes commented illustrations of all (eventually) major internal Perl data structures. Having this document handy hopefully makes reading the Perl source code easier. I'll try to expand it as I learn more.

The first thing to look at are the data structures that represents Perl data; scalars of various kinds, arrays and hashes. Internally Perl calls a scalar SV (scalar value), an array AV (array value) and a hash HV (hash value). In addition it uses IV for integer value, NV for numeric value (aka double), PV for a pointer value (aka string value (char*), but 'S' was already taken), and RV for reference value. The IVs are further guaranteed to be big enough to hold a void*.

The internal relationship between the Perl data types is really object oriented. Perl relies on using C's structural equivalence to help emulate something like C++ inheritance of types. The various data types that Perl implement are illustrated in this class hierarchy diagram. The arrows indicate inheritance.

As you can see Perl use multiple inheritance with C SvNULL acting as some kind of virtual base class. All the Perl types are identified by small numbers, and the internal Perl code often get away with testing the ISA-relationship between types with the <= operator. As you can see from the figure above, this can only work reliably for some comparisons. All Perl data value objects are tagged with their type, so you can always ask an object what it's type is and act according to this information.

The symbolic type names (and associated value) are:

0) SVt_NULL
1) SVt_IV
2) SVt_NV
3) SVt_RV
4) SVt_PV
5) SVt_PVIV
6) SVt_PVNV
7) SVt_PVMG
8) SVt_PVBM
9) SVt_PVLV
10) SVt_PVAV
11) SVt_PVHV
12) SVt_PVCV
13) SVt_PVGV
14) SVt_PVFM
15) SVt_PVIO

In addition to the simple type names already mentioned, the following names are found in the figure: An SvPVIV value can hold a string and an integer value. An SvPVNV value can hold a string, an integer and a double value. The SvPVMG is used when magic is attached or the value is blessed. The SvPVBM adds information for fast searching (Boyer-Moore) on the string value. The SvPVLV represents a l-value object (the result of substr). CV is a code value, which represents a perl function/subroutine/closure or contains a pointer to an XSUB. GV is a glob value and IO contains pointers to open files and directories and other state information about files. The SvPVFM is used to hold information on forms.

Perl data objects can change type as the value change. The SVs is said to be upgraded in this case. Type changes only go down the hierarchy. (See the sv_upgrade() function in sv.c.)

The actual layout in memory does not really match how a typical C++ compiler would implement a hierarchy like the one depicted above. Let's see how it is done.

In the description below we use field names that match the macros that are used to access the corresponding field. For instance the xpv_cur field of the xpvXX structs are accessed with the SvCUR() macro. The field is referred to as CUR in the description below.

SvNULL and struct sv

The simplest type is SvNULL. It always represents an undefined scalar value. It consist of the "struct sv" only, and looks like this:

It contains a pointer (ANY) to more data, which in this case is always NULL. All the subtypes are implemented by attaching additional data to the ANY pointer.

The second field is an integer reference counter (REFCNT) which should tell us how many pointers reference this object. When Perl data types are created this value is initialized to 1. The field must be incremented when a new pointer is made to point to it and decremented when the pointer is destroyed or assigned a different value. When the reference count reach zero the object is freed.

The third field contains some FLAGS and a TYPE sub-field.

The type field contains a code that represents one of the types shown in the type hierarchy figure above.

The SV contains 24 flag bits. The flags bits are used to denote how the fields of the type value objects should be interpreted and various other state of the objects. Some flags are just used as optimizations in order to avoid having to dereference several levels of pointers just to find that the information is not there.

The purpose of the flag bits are:

0) PADBUSY
reserved for tmp or my already

1) PADTMP
in use as tmp

2) PADMY
in use a "my" variable

3) TEMP
string is stealable

4) OBJECT
This flag is set when the object is "blessed". It can only be set for value type SvPVMG or subtypes of it. This flag also indicate that the STASH pointer is valid and points to a namespace HV.

5) GMAGICAL (Get Magic)
This flag indicate that the object has a magic get or len method to be invoked. It can only be set for value type SvPVMG or subtypes of it. This flag also indicate that the MAGIC pointer is valid.

6) SMAGICAL (Set Magic)
This flag indicate that the object has a magic set method to be invoked.

7) RMAGICAL (Random Magic)
This flag indicate that the object has any other magical methods (besides get/len/set magic method) or even methodless magic attacted.

Any of GMAGICAL, SMAGICAL and RMAGICAL is called MAGICAL

8) IOK (Integer OK)
This flag indicate that the object has a valid IVX field value. It can only be set for value type SvIV or subtypes of it.

9) NOK (Numeric OK)
This flag indicate that the object has a valid NVX field value. It can only be set for value type SvNV or subtypes of it.

10) POK (Pointer OK)
This flag indicate that the object has a valid PVX, CUR and LEN field values (i.e. a valid string value). It can only be set for value type SvPV or subtypes of it.

11) ROK (Reference OK)
This flag indicate that the type is SvRV and that the RV field contains a valid reference pointer. A SvRV object with ROK flag off represents an undefined value.

12) FAKE
glob or lexical is just a copy

13) OOK (Offset OK)
This flag indicate that the IVX value is to be interpreted as a string offset. This flag can only be set for value type SvPVIV or subtypes of it. It also follows that the IOK (and IOKp) flag must be off when OOK is on. Take a look at the SvOOK figure below.

14) BREAK
refcnt is artificially low

15) READONLY
This flag indicate that the value of the object may not be modified.

16) IOKp (Integer OK Private)
has valid non-public integer value

17) NOKp (Numeric OK Private)
has valid non-public numeric value

18) POKp (Pointer OK Private)
has valid non-public pointer value

19) SCREAM
has been studied

20) AMAGIC
has magical overloaded methods

21) SHAREKEYS
22) LAZYDEL
22) TAIL
23) VALID
23) COMPILED

The struct sv is common for all subtypes of SvNULL in Perl. In the Perl source code this structure is typedefed to SV, AV, HV and others. Routines that can take any type as parameter will have SV* as parameter. Routines that only work with arrays or hashes have AV* or HV* respectively in their parameter list.

SvPV

A scalar that can hold a string value is called an SvPV. In addition to the SV struct of SvNULL, an xpv struct is allocated and it contains 3 fields. PVX is the pointer to an allocated char array. CUR is an integer giving the current length of the string. LEN is an integer giving the length of the allocated string. The char/byte at (PVX + CUR) should always be '\0' in order to make sure that the string is NUL-terminated if passed to C library routines. This requires that LEN is always at least 1 larger than CUR.

The POK flag indicates if the string pointed to by PVX contains an valid value. A SvPV with the POK flag turned off represents undef. The PVX pointer can also be NULL when POK is off.

SvPVIV and SvPVNV

The SvPVIV type is like SvPV but has an additional field to hold a single integer value called IVX. The IOK flag indicates if the IVX value is valid. If both the IOK and POK flag is on, then the PVX will (usually) be a string representation of the same number found in IVX.

The SvPVNV type is like SvPVIV but has an additional field to hold a single double value called NVX. The corresponding flag is called NOK.

SvOOK

As a special hack in order to improve the speed of removing characters from the beginning of a string, the OOK flag is used. When this flag is on, then the IVX value is not interpreted as an integer value, but is instead used as an offset into the string. The PVX, CUR, LEN is adjusted to point within the allocated string instead. The sv_chop()/sv_backoff() routines adjusts the offset.

SvIV and SvNV

As a special case we also have SvIV and SvNV types that only have room for a single integer or a single double value. These are special in that the PVX/CUR/LEN fields are not present even if the ANY pointer acutally points to the ghostual incarnation of them. This arrangement makes it possible for code to always access the IVX/NVX fields at a fixed offset from where the SV field ANY points.

SvRV

The SvRV subtype just lets the SV field ANY point to a pointer which points to an SV (which can be any of the subtypes above and below).

SvMG

The SvPVMG is like SvPVNV above, but has two additional fields; MAGIC and STASH. MAGIC is a pointer to additional structures that contains callback functions and other data. If the MAGIC pointer is non-NULL, then one or more of the MAGICAL flags will be set.

STASH is a pointer to a HV that represents some namespace/class. This field is set when the value is blessed into a package (becomes an object). The OBJECT flag will be set when STASH is.

The MAGIC structure in detail....

AV

An array is in many ways represented similar to strings. An AV contains all the fields of SvPVMG and adds the following tree fields: ALLOC is a pointer to the allocated array. ARYLEN is a pointer to an SV (which is returned when $#array is requested). FLAGS contains some extra flag bits that are specific of the array subtype.

The first three fields of xpvav has been renamed even if the serve nearly the same function. PVX has become ARRAY. CUR has become FILL and LEN has become MAX. One difference is that the value of FILL/MAX is always one less than CUR/LEN would be in the same situation. The IVX/NVX fields are unused.

The array pointed to by ARRAY contains pointers to any of the SvNULL subtypes. Usually ALLOC and ARRAY both points to the start of the allocated array. The use of two pointers is similar to the OOK hack described above. The shift operation can be implemented efficiently by just adjusting the ARRAY pointer (and FILL/MAX). Similar the pop just involves decrementing the FILL count.

The are only 3 array flags used: (I'll try to describe them when I understand the issue)

0) REAL
1) REIFY
2) REUSED

HV

Hashes is the most complex of the Perl data types. In addition to what we have seen above HVs use HE structs to represent a key/value pairs and HEK struct to represents keys.

The HV type itself contains all the fields of SvPVMG and then adds four new fields:

As for AVs the first few fields of the xpvhv has been renamed in the same way. MAX is the number of elements in ARRAY minus one. (The size of the ARRAY is required to be a power of 2, since the code just ARRAY[HASH & MAX] to locate the correct HE column for an key). Also note that ARRAY can be NULL (but MAX will never be below 7). The FILL is the number of elements in ARRAY which is not NULL. The IVX field has been renamed as KEYS an is the number of hash elements in the HASH. The NVX field is unused.

In a perfect hash both KEYS and FILL are the same value. This means than all HEs can be located directly in the ARRAY (and all the he->next pointers are NULL).

The following two hash specific flags are found among the common SvNULL flags:

21) SHAREKEYS
keys live on shared string table

22) LAZYDEL
entry in xhv_eiter must be deleted

More stuff .....




© 1998 Gisle Aas
<aas@sn.no>