public final class LanguageDetectorImpl extends java.lang.Object implements LanguageDetector
This class is immutable and thus thread-safe.
| Modifier and Type | Field and Description |
|---|---|
private double |
alpha |
private static double |
ALPHA_WIDTH
TODO document what this is for, and why that value is chosen.
|
private static int |
BASE_FREQ
TODO document what this is for, and why that value is chosen.
|
private static double |
CONV_THRESHOLD
TODO document what this is for, and why that value is chosen.
|
private static long |
DEFAULT_SEED
This is used when no custom seed was passed in.
|
private static int |
ITERATION_LIMIT
TODO document what this is for, and why that value is chosen.
|
private static org.slf4j.Logger |
logger |
private double |
minimalConfidence |
private static int |
N_TRIAL
TODO document what this is for, and why that value is chosen.
|
private NgramExtractor |
ngramExtractor |
private @NotNull NgramFrequencyData |
ngramFrequencyData |
private double |
prefixFactor |
private @Nullable double[] |
priorMap
User-defined language priorities, in the same order as
langlist. |
private double |
probabilityThreshold |
private com.google.common.base.Optional<java.lang.Long> |
seed |
private int |
shortTextAlgorithm |
private double |
suffixFactor |
| Constructor and Description |
|---|
LanguageDetectorImpl(@NotNull NgramFrequencyData ngramFrequencyData,
double alpha,
com.google.common.base.Optional<java.lang.Long> seed,
int shortTextAlgorithm,
double prefixFactor,
double suffixFactor,
double probabilityThreshold,
double minimalConfidence,
@Nullable java.util.Map<LdLocale,java.lang.Double> langWeightingMap,
@NotNull NgramExtractor ngramExtractor)
Use the
LanguageDetectorBuilder. |
| Modifier and Type | Method and Description |
|---|---|
com.google.common.base.Optional<LdLocale> |
detect(java.lang.CharSequence text) |
private @Nullable double[] |
detectBlock(java.lang.CharSequence text) |
private double[] |
detectBlockLongText(java.util.List<java.lang.String> ngrams)
This is the original algorithm used for all text length.
|
private double[] |
detectBlockShortText(java.util.Map<java.lang.String,java.lang.Integer> ngrams) |
java.util.List<DetectedLanguage> |
getProbabilities(java.lang.CharSequence text)
Returns all languages with at least some likeliness.
|
private double[] |
initProbability()
Initialize the map of language probabilities.
|
private @NotNull java.util.List<DetectedLanguage> |
sortProbability(double[] prob)
Returns the detected languages sorted by probabilities descending.
|
private boolean |
updateLangProb(@NotNull double[] prob,
@NotNull java.lang.String ngram,
int count,
double alpha)
update language probabilities with N-gram string(N=1,2,3)
|
private static final org.slf4j.Logger logger
private static final double ALPHA_WIDTH
private static final int ITERATION_LIMIT
private static final double CONV_THRESHOLD
private static final int BASE_FREQ
private static final int N_TRIAL
private static final long DEFAULT_SEED
@NotNull private final @NotNull NgramFrequencyData ngramFrequencyData
@Nullable private final @Nullable double[] priorMap
langlist.private final double alpha
private final com.google.common.base.Optional<java.lang.Long> seed
private final int shortTextAlgorithm
private final double prefixFactor
private final double suffixFactor
private final double probabilityThreshold
private final double minimalConfidence
private final NgramExtractor ngramExtractor
LanguageDetectorImpl(@NotNull
@NotNull NgramFrequencyData ngramFrequencyData,
double alpha,
com.google.common.base.Optional<java.lang.Long> seed,
int shortTextAlgorithm,
double prefixFactor,
double suffixFactor,
double probabilityThreshold,
double minimalConfidence,
@Nullable
@Nullable java.util.Map<LdLocale,java.lang.Double> langWeightingMap,
@NotNull
@NotNull NgramExtractor ngramExtractor)
LanguageDetectorBuilder.public com.google.common.base.Optional<LdLocale> detect(java.lang.CharSequence text)
detect in interface LanguageDetectortext - You probably want a TextObject.public java.util.List<DetectedLanguage> getProbabilities(java.lang.CharSequence text)
LanguageDetectorThere is a configurable cutoff applied for languages with very low probability.
The way the algorithm currently works, it can be that, for example, this method returns a 0.99 for Danish and less than 0.01 for Norwegian, and still they have almost the same chance. It would be nice if this could be improved in future versions.
getProbabilities in interface LanguageDetectortext - You probably want a TextObject.@Nullable private @Nullable double[] detectBlock(java.lang.CharSequence text)
private double[] detectBlockShortText(java.util.Map<java.lang.String,java.lang.Integer> ngrams)
private double[] detectBlockLongText(java.util.List<java.lang.String> ngrams)
private double[] initProbability()
private boolean updateLangProb(@NotNull
@NotNull double[] prob,
@NotNull
@NotNull java.lang.String ngram,
int count,
double alpha)
count - 1-n: how often the gram occurred.@NotNull private @NotNull java.util.List<DetectedLanguage> sortProbability(double[] prob)