Skip to content

🔧 LPMM Knowledge Base Import File Requirements

OpenIE File Naming Requirements

After extraction is complete, OpenIE files will be named 月-日-时-分-openie.json, stored in data/openie directory.

At this time, you can rename, for example add content description.

But, the -openie.json suffix at the end of the file must be retained.

Examples:

Valid file names:

  • 千恋万花剧情-openie.json
  • 明日方舟全剧情-openie.json
  • 网络热梗(截止到2023年10月)-openie.json

Invalid file names:

  • 贴吧热梗-openie.txt (non-json format)
  • 明日方舟全剧情.json (no openie identifier)
  • 114514-openie.json (content description not precise)

一、OpenIE Technology Overview

Open Information Extraction (OpenIE) is an open domain information extraction technology that aims to automatically extract structured relation triples (subject-predicate-object) from unstructured text, without needing to predefine relation types. Its core characteristics include:

  • Unsupervised: Doesn't rely on predefined domain ontologies or relation libraries.
  • Flexibility: Can handle diverse language expressions (e.g., "Apple founded by Steve Jobs" and "Steve Jobs is Apple's founder").
  • Redundancy tolerance: Allows multiple extractions of same relation different expressions.

二、OpenIE Data Format Specification

1. Overall Structure

json
{
    "docs": [
        {
            "idx": "Document's unique identifier (usually text's SHA256 hash value)",
            "passage": "Document's original text",
            "extracted_entities": ["entity1", "entity2", ...],
            "extracted_triples": [["subject", "predicate", "object"], ...]
        },
        ...
    ],
    "avg_ent_chars": "Entity average character count",
    "avg_ent_words": "Entity average word count"
}

2. Field Explanation

FieldTypeDescription
docsArrayContains all documents' extraction results
idxStringDocument unique identifier (usually uses SHA256 hash value to ensure text uniqueness)
passageStringOriginal text content
extracted_entitiesString arrayAll entities identified from text (deduplicated)
extracted_triplesTriple arrayExtracted structured relations, each triple format ["subject", "predicate", "object"]
avg_ent_charsNumericEntity average character length (statistics all extracted_entities)
avg_ent_wordsNumericEntity average word count (calculated by space word segmentation)

3. Example

json
{
    "docs": [
        {
            "idx": "a1b2c3...",
            "passage": "Steve Jobs founded Apple in 1976.",
            "extracted_entities": ["Steve Jobs", "Apple", "1976"],
            "extracted_triples": [
                ["Steve Jobs", "founded", "Apple"],
                ["Apple", "founded in", "1976"]
            ]
        }
    ],
    "avg_ent_chars": 8.3,
    "avg_ent_words": 1.7
}