xlingualNER API Tutorial

This tutorial will walk through the usage of the xlingualNER API.

Introduction

The purpose of the xlingualNER system is to tag sentences of various languages with our fine-grained NER labels. The result this system, will be a file with the word on the left, and the tag on the right, space seperated. In the future, we hope this system will support more languages and also have increased accuracy.

The current endpoint for the xlingualNER is https://seaword.seasalt.ai/ner/.

See technical documentation and try out the API with the Swagger docs <https://seaword.seasalt.ai/ner/docs>.

Current models for xlingualNER can be found in the Azure fileshare at /mnt/models/nlp/xlingual-ner/. model_final is our final model, and the model_baseline is trained on all English and Chinese Ontonotes.

Implementation

The xlingualNER pipeline consists of three parts: NERtagger, a datetime, misc outputs (ip, email address, url). The input is a sentence of any supported language, and an output example is listed further below. Our model uses NLTK’s word tokenizer by default, and uses Jieba for ZH.

The labels are as follows (copied from annotation instructions document):

PERSON

Identifies the first names, last names, family names, unique nicknames of people and fictional characters. Generational markers (Jr., IV) are included in the extent, while personal and occupational titles such as Ms., Dr., President, Secretary are NOT included.

  • Example: Dr. [TITLE] Bob Smith, Sr. [PERSON] is Chinese-American [NATIONALITY]

  • Example: Mitt Romney[PERSON] is a politician

NORP

Adjectival forms of named religions (e.g. Christian, Muslim, Jewish), heritage, tribes, political groups(e.g. Democratic, Republicans)

  • Example: Ramadhan[EVENT] is a holy month for Muslim[NORP] communities

  • Example: I love South American[NORP] cuisines

FAC

Man-made infrastructures such as streets, bridges, highways, buildings, monuments, airports, parks.

  • Example: When I go to New York City[CITY], I shop on 5th Avenue[FAC].

  • Example: I will be flying from YVR Airport [FAC]

  • Example: I drove through the I-95 [FAC] to Maine[STATE_OR_PROVINCE]

  • Example: I went to Baxter State Park[FAC]

ORG

Describes government agencies, political agencies, educational institutions, sport teams, and musical groups. Names of hospitals, libraries, and museums are marked as ORG except when those are referred to in a locative way (expressing location).

  • Example: United States Congress[ORG] had a meeting

  • Example:The United Nations [ORG] meeting took place this weekend [DATE]

  • Example: the Supreme Court Justice[ORG]

  • Example: the White House[ORG]

ORG_COMP

Describes names of companies

  • Example: They recently opened the first[ORDINAL] Paul Bakery[ORG_COMP] in Canada[COUNTRY]

TITLE

Includes job and position titles such as Professor, Doctor, as well as royal family titles

  • Example: The Queen[TITLE] of England[COUNTRY] loves ice cream

  • Example: I am a queen who loves ice cream → no named-entity

RELIGION

Religious beliefs and concepts like Judaism, Hinduism, Buddhism, Islam. However, it does not include religious groups.

  • Example: The spread of Islam[RELIGION] in Indonesia[COUNTRY] did not come directly from Arabic[NATIONALITY] country. Source

  • Example: Christianity[RELIGION] is a religion, while Christians[NORP] are religious groups

LOC

Named place such as mountain, river, region, continent, street name. IMPORTANT NOTE about LOC VS FAC labels: LOC are named-entities that indicate location but not a specific public man-made facility. LOC is more general and is used more often.

  • Example: Montenegro[COUNTRY] is located in Eastern Europe[LOC].

  • Example: East of France[LOC]

  • Example: South Boston[LOC]

  • Example: I will meet you at Chinatown[LOC]

CITY

City names, towns

  • Example: The first Starbucks[ORG_COMP] store was opened in Seattle[CITY].

  • Example: Company based in Bloomfield Township[CITY] , Oakland County[CITY] , Michigan[STATE_OR_PROVINCE].

STATE_OR_PROVINCE

Includes names of state or province

  • Example: New Hampshire[STATE_OR_PROVINCE] is the perfect outdoor playground

ZIP_CODE

A zip code for Canadian or American zipcodes

  • Example: V1D6P3 or V1D 6P3 or 79645

COUNTRY

Includes names of countries

  • Example: Bali[CITY] is located in Indonesia[COUNTRY]

NATIONALITY

Any members of state, country, city

  • Example: Many Canadians[NATIONALITY] want to become Parisians[NATIONALITY].

MISC

Anything that would be capitalized in English, even if it’s not in your language. (EG: Grammy Awards)

  • Example: They are using the latest Wi-Fi[MISC] technology.

PRODUCT

Named products by companies

  • Example: I want to buy the latest Sony Camera[PRODUCT] by Sony[ORG_COMP]

  • Example: I got a new iPod[PRODUCT] for my birthday

  • Example: I do not want to buy cameras produced by Fuji[ORG_COMP]

EVENT

Any popular/repeating events

  • Example: He attended The Presidential Election[EVENT]

WORK_OF_ART

Name of a play, movie, song, painting, or book

  • Example: He stole The Mona Lisa[WORK_OF_ART] last week from Louvre Museum[FAC].

LAW

Legal entity names

  • Example: They passed the American Constitution[LAW] in 1867[DATE]

LANGUAGE

Any language name

  • Example: I speak English[LANGUAGE], Bahasa Indonesia[LANGUAGE], and French[LANGUAGE]

DATE

Includes day, month, or year. Note that indirect references, such as “today”, “yesterday”, and “3 weeks from now” should also be tagged.

  • Example: He was born on January 3, 2002[DATE].

  • Example: On Monday[DATE], he was stuck in traffic for 2 hours[TIME]

  • Example: Last week[DATE], I worked for 8 hours[TIME]

  • Example: I love the 1940s[DATE] movies

  • Example: In the fall of 2008, I moved to Australia [COUNTRY]

TIME

Specific hour / minute / seconds.

  • Example: I have a meeting today[DATE] at 9:30am[TIME]

  • Negative example: On Wednesday[DATE], he was stuck in traffic for 2 hours[DURATION]

DURATION

General time span

  • Example: I have a piano lesson every Wednesday[SET] from 2pm to 3pm[DURATION]

SET

Specific repeating time span

  • Example: Every Tuesday[SET] he plays tennis

PERCENT

Anything with percentage

  • Example: He made around 75 percent[PERCENT] in profit.

MONEY

Related to money and currencies

  • Example: He borrowed 17,000 British pounds[MONEY]

  • Another example: The Indonesian rupiah[MONEY] depreciated 50%[PERCENT] against the U.S. dollar[MONEY]

CARDINAL

Number that doesn’t have a measurement

  • Example: Scrabble[PRODUCT] is a game played by 2[CARDINAL] , 3[CARDINAL] or 4[CARDINAL] people

  • Example: about half[ORDINAL] of the class received good marks

  • Example: four[ORDINAL] of my family members love skiing

  • Negative example: He bought one kg[QUANTITY] of oranges

QUANTITY

Number and must have units of measurement (as of distance, weight)

  • Example: Sharks grow up to 20 feet [QUANTITY]

  • Example: Twenty[CARDINAL] 20 feet[QUANTITY] sharks

  • Example: I walked 4 miles [QUANTITY] to my home and carried one kg[QUANTITY]of oranges

ORDINAL

Numbers that contain some sort of order

  • Example: He was first[ORDINAL] to arrive.

  • Another example: He was the millionth[ORDINAL] person to win.

  • Another example: Kamala Harris[PERSON] is the first[ORDINAL] African American[NATIONALITY] to hold the office of Attorney General[TITLE] in the state’s history.

URL

Website url

  • Example: For more information, please visit our website at JD.com [URL]

CRIMINAL_CHARGE

Legal term

  • Example: He was charged with manslaughter[CRIMINAL_CHARGE] last week[TIME]

The last 3 tags were confusing so we removed them from our Label Studio annotations. They are no longer in the gold data, only the silver data that we trained on.

  • CAUSE_OF_DEATH

  • HANDLE

  • IDEOLOGY

Example

The following english sentence is submitted to the system:

{
    "text": "In the World War(1914–1918) the First Çanakkale was a battle area. Millions died in this war."
}

The system outputs three key elements which are a list of named entity tags at a token level, datetime grounding and miscellaneous information such as Email, URL and IP-address:

{
    "tags": [
        {
            "In": "O"
        },
        {
            "the": "B-EVENT"
        },
        {
            "World": "I-EVENT"
        },
        {
            "War": "I-EVENT"
        },
        {
            "(": "O"
        },
        {
            "1914–1918": "B-DATE"
        },
        {
            ")": "O"
        },
        {
            "the": "B-LOC"
        },
        {
            "First": "I-LOC"
        },
        {
            "Çanakkale": "I-LOC"
        },
        {
            "was": "O"
        },
        {
            "a": "O"
        },
        {
            "battle": "O"
        },
        {
            "area": "O"
        },
        {
            ".": "O"
        },
        {
            "Millions": "B-CARDINAL"
        },
        {
            "died": "O"
        },
        {
            "in": "O"
        },
        {
            "this": "O"
        },
        {
            "war": "O"
        },
        {
            ".": "O"
        }
    ],
    "datetime": [
        {
            "start": 17,
            "end": 20,
            "resolution": {
                "values": [
                    {
                        "timex": "1914",
                        "type": "daterange",
                        "start": "1914-01-01",
                        "end": "1915-01-01"
                    }
                ]
            },
            "text": "1914",
            "type_name": "datetimeV2.daterange"
        },
        {
            "start": 22,
            "end": 25,
            "resolution": {
                "values": [
                    {
                        "timex": "1918",
                        "type": "daterange",
                        "start": "1918-01-01",
                        "end": "1919-01-01"
                    }
                ]
            },
            "text": "1918",
            "type_name": "datetimeV2.daterange"
        }
    ],
    "misc": {
        "Email": [],
        "URL": [],
        "IP-address": []
    }
}

The final tagging result from the system is the tags field. The datetime field shows the datetime grounding with start date and end date as well as the corresponding span of the input sentence. The misc field shows information about Email, URL and IP-address.

Language Support

The Cross-Lingual Named Entity Recognition API currently supports 9 languages. The input language following a convention of {lang_code}-{country_code} is specified via a query parameter when calling the demo endpoint.

While we intent to add more language support in the future, the following language codes are currently supported

Language

Code

English

en-XX, en-US, en-GB, en-AU, en-SG

Traditional Chinese

zh-TW

Simplifed Chinese

zh-CN

Indonesian

id-ID

Javanese

jv-ID

Malay

ms-MY

Tagalog

tl-PH

Vietnamese

vi-VN

Czech

cs-CZ

Croation

hr-HR

API usage

POST /extract

To tag named-entities of a sentence, send a POST request to the /extract endpoint. Additionally, the language must be specified as a query parameter in the URL.

POST https://seaword.seasalt.ai/ner/{lang_code}/extract?access_token={api_key}

The endpoint tags a list of named entities calculates datetime grounding and extracts Email, URL and IP-address from the input sentences. The required request body for Named Entity Recognition requires only the following fields:

{
    "text": "string"
}

Once the Named Entity tagging has been performed on the full sentence, you will get the following result:

{
    "tags": "List",
    "datetime": "List",
    "misc": "Dict"
}

In this result the tags field represents the final tagged named entities, datetime is a list of datetime grounding that appears in the input sentence, misc is a dictionary contains Email, URL and IP-address extracted from the input sentence.