파이썬 코드 읽어보기 - json/init.py

urangurang

2019-08-24

python

JSON, python

json - JSON encoder and decoder

출처 : https://docs.python.org/3/library/json.html
코드 : https://github.com/python/cpython/blob/3.7/Lib/json/__init__.py

파이썬 코드 읽어보기 첫 번째 시리즈는 json입니다.

import json

개발 중 흔하게 만나던 json의 내부는 어떻게 되어 있는지 같이 확인해봅시다.

저는 cpython repository에서 코드를 확인해봤습니다. 파이썬 코드는 ./Lib/ 디렉토리 아래에서 확인할 수 있습니다.

cpython/Lib/json/ 디렉토리안의 내용입니다.

__init__.py
decoder.py
encoder.py
scanner.py
tool.py

본 글에서는 __init__.py를 확인해보겠습니다.

`init.py`

메소드 및 주요 변수 중심으로 천천히 설명하겠습니다.

def dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True,
         allow_nan=True, cls=None, indent=None, separators=None,
         default=None, sort_keys=False, **kw):
    if (not skipkeys and ensure_ascii and
        check_circular and allow_nan and
        cls is None and indent is None and separators is None and
        default is None and not sort_keys and not kw):
        iterable = _default_encoder.iterencode(obj)
    else:
        if cls is None:
            cls = JSONEncoder
        iterable = cls(skipkeys=skipkeys, ensure_ascii=ensure_ascii,
            check_circular=check_circular, allow_nan=allow_nan, indent=indent,
            separators=separators,
            default=default, sort_keys=sort_keys, **kw).iterencode(obj)
    # could accelerate with writelines in some versions of Python, at
    # a debuggability cost
    for chunk in iterable:
        fp.write(chunk)

dump()는 첫 번째 변수로 받는 obj를 JSON 포맷으로 변환하여 두 번째 변수로 받는 fp(file object, file-like object 와 동의어, streams라고 불리기도 한다)에 변환된 값을 넣는 메소드입니다.

파이썬 오브젝트가 어떤 타입으로 변환되는지는 아래의 표를 참조하여 확인할 수 있다.

Python	JSON
dict	object
list, tuple	array
str	string
int, float, int- & float-derived Enums	number
True	true
False	false
None	null

json 모듈은 생성하는 오브젝트의 타입은 항상 str이기 때문에 fp 오브젝트 역시 str 타입의 인풋이 가능해야만 한다.

첫 번째 if/else절을 보면 특별히 파라미터를 지정하지 않은 경우 _default_encoder을 사용하여 입력 받은 첫번째 파라미터, 즉 오브젝트를 iterable라는 이름의 iterable 객체로 반환합니다. obj, fp 이외의 파라미터가 입력되면 else 절로 들어가게 됩니다. cls가 특별히 입력되지 않은 경우, JSONEncoder를 사용합니다.

iterable 객체를 for loop을 통해 두 번째 파라미터 fp에 값을 작성합니다.

def dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True,
        allow_nan=True, cls=None, indent=None, separators=None,
        default=None, sort_keys=False, **kw):
    # cached encoder
    if (not skipkeys and ensure_ascii and
        check_circular and allow_nan and
        cls is None and indent is None and separators is None and
        default is None and not sort_keys and not kw):
        return _default_encoder.encode(obj)
    if cls is None:
        cls = JSONEncoder
    return cls(
        skipkeys=skipkeys, ensure_ascii=ensure_ascii,
        check_circular=check_circular, allow_nan=allow_nan, indent=indent,
        separators=separators, default=default, sort_keys=sort_keys,
        **kw).encode(obj)

def detect_encoding(b):
    bstartswith = b.startswith
    if bstartswith((codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE)):
        return 'utf-32'
    if bstartswith((codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE)):
        return 'utf-16'
    if bstartswith(codecs.BOM_UTF8):
        return 'utf-8-sig'

    if len(b) >= 4:
        if not b[0]:
            # 00 00 -- -- - utf-32-be
            # 00 XX -- -- - utf-16-be
            return 'utf-16-be' if b[1] else 'utf-32-be'
        if not b[1]:
            # XX 00 00 00 - utf-32-le
            # XX 00 00 XX - utf-16-le
            # XX 00 XX -- - utf-16-le
            return 'utf-16-le' if b[2] or b[3] else 'utf-32-le'
    elif len(b) == 2:
        if not b[0]:
            # 00 XX - utf-16-be
            return 'utf-16-be'
        if not b[1]:
            # XX 00 - utf-16-le
            return 'utf-16-le'
    # default
    return 'utf-8'

detect_encoding() 입력된 파라미터 b(바이트) 의 인코딩을 str 로 반환하는 메소드. codecs.py 안의 BOM_UTF32_BE, BOM_UTF32_LE, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF8와 같은 상수를 사용하여 입력된 문자열의 인코딩을 판단합니다.

codecs.py 안의 상수들을 한 번 보겠습니다.

# Byte Order Mark (BOM = ZERO WIDTH NO-BREAK SPACE = U+FEFF)
# and its possible byte string values
# for UTF8/UTF16/UTF32 output and little/big endian machines

# UTF-8
BOM_UTF8 = b'\xef\xbb\xbf'

# UTF-16, little endian
BOM_LE = BOM_UTF16_LE = b'\xff\xfe'

# UTF-16, big endian
BOM_BE = BOM_UTF16_BE = b'\xfe\xff'

# UTF-32, little endian
BOM_UTF32_LE = b'\xff\xfe\x00\x00'

# UTF-32, big endian
BOM_UTF32_BE = b'\x00\x00\xfe\xff'

보기에 앞서 주석 처리된 첫 번째 세 줄을 보면 BOM(Byte order mark) 이라는 단어를 발견할 수 있습니다. BOM은 해당 텍스트 파일의 인코딩을 명시하며 따라서 식별될 수도 있는 문자열 입니다. 유니코드 문자열 U+FEFF로서 인코딩 됩니다. BOM의 사용은 선택적으로 적용될 수 있으며, 사용하는 경우엔 텍스트 스트림의 첫 지점에 나타나야 합니다. BOM은 endianess, encoding(utf-8, utf-16, etc…) 등의 정보를 가지고 있습니다.

Byte order marks by encoding from wikipedia

def load(fp, *, cls=None, object_hook=None, parse_float=None,
         parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    return loads(fp.read(),
                 cls=cls, object_hook=object_hook,
                 parse_float=parse_float, parse_int=parse_int,
                 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

load()는 dump()와 반대의 기능입니다. fp(file-like object)를 파이썬 오브젝트로 deserialize(역직렬화)해주는 기능입니다. load() 메소드 자체는 별다른 로직을 가지고 있지 않습니다. 아래에 있는 loads()를 호출하고 있습니다.

def loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    if isinstance(s, str):
        if s.startswith('\ufeff'):
            raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
                                  s, 0)
    else:
        if not isinstance(s, (bytes, bytearray)):
            raise TypeError(f'the JSON object must be str, bytes or bytearray, '
                            f'not {s.__class__.__name__}')
        s = s.decode(detect_encoding(s), 'surrogatepass')

    if "encoding" in kw:
        import warnings
        warnings.warn(
            "'encoding' is ignored and deprecated. It will be removed in Python 3.9",
            DeprecationWarning,
            stacklevel=2
        )
        del kw['encoding']

    if (cls is None and object_hook is None and
            parse_int is None and parse_float is None and
            parse_constant is None and object_pairs_hook is None and not kw):
        return _default_decoder.decode(s)
    if cls is None:
        cls = JSONDecoder
    if object_hook is not None:
        kw['object_hook'] = object_hook
    if object_pairs_hook is not None:
        kw['object_pairs_hook'] = object_pairs_hook
    if parse_float is not None:
        kw['parse_float'] = parse_float
    if parse_int is not None:
        kw['parse_int'] = parse_int
    if parse_constant is not None:
        kw['parse_constant'] = parse_constant
    return cls(**kw).decode(s)

loads()의 파라미터부터 보겠습니다. 첫 번째 파라미터 s는 JSON 문서를 포함하고 있는 str, bytes, bytearray를 말합니다.

입력받은 s의 타입 검사부터 시작합니다. str, bytes, bytearray 셋 중 하나에 해당하지 않으면 TypeError를 일으킵니다.

encoding의 경우 deprecated 된 파라미터이기에 경고 문구를 나타내고 키워드 파라미터로부터 삭제합니다. cls 가 입력된 경우엔 기본 Decoder인 JSONDecoder를 사용하지 않고 입력받은 Decoder를 사용하여 Deserialize합니다. loads() 메소드 역시 입력받은 파라미터를 가지고 JSONDecoder의 decode() 메소드를 호출하는 역할이기에 타입 검사 이외의 로직은 보이지 않습니다.

__init__.py의 설명은 이렇게 끝났습니다. 전체 370 라인을 가지고 있지만 대부분이 주석 및 예제로 이루어져 있기에 실제 코드의 양을 얼마 되지 않아 금방 볼 수 있었습니다.

다음으로는 encoder.py를 보겠습니다.