csc 599: computational scientific discovery lecture 3: data, data structures, and reasoning in...
Post on 20-Dec-2015
213 views
TRANSCRIPT
CSC 599: Computational Scientific Discovery
Lecture 3: Data, Data Structures, and Reasoning
in Science
Outline
Computers Helping ScientistsData in Science
Data Catalogs
Data structures for scientific knowledge Equations Symbolic knowledge
Simulations and scientific reasoning Do simulations Architectures for sci. reasoning QSIM
Next time: Probabilistic reason and graphical models
Computers cooperating with Scientists
Play to the strengths of computers Fast Accurate (we hope!) Don't get bored Good for simulations of complex systems!
Avoiding their weaknesses No “common sense”
Garbage In, Garbage out Calculation bugs
Time to develop/maintain code Use someone else's program, if exists
Don't Just Code-It-Up!
Do you really need a computer? Analytical solutions are cooler Integrate! (If you can) Ordinary differential equation
System of eqns & derivatives w.r.t. one independ var Eg. time
Laplace and z-transformations Perturbation expansions and discrete time eqns
Partial differential equations Another variable can change
For fluids: pressure, temperature, etc. Separation of variables
See Gershenfeld “The Nature of Mathematical Modeling”
Data in Science
You make a measurement, should record? Value (numeric or conceptual) Precision
For measured numbers: Mean and std deviation For concepts: set identity
“car” vs. “Toyota” vs. “Prius” Domain
System constraints Range definition limits System extent limits Saturation limits
Instruments constraints Range definition limits Detection limits Reliable limits
Data constraints Observed min/max w/in system extent and detect limits
Data in Science (2)
You make a measurement, should record? Units and Dimensions When and Where
Speed of light expected to be observer-independent Count of robins? First bloom of dandelions?
With what equipment Eg. camera model and film type
By whom? Implicit paradigm Yesterday's Chinese “guest stars” are today's
supernovas Why?
What are you trying to look for? Might hint at what your data do not show
Laboratories: Where the Data Comes From
Places where observations are made Astronomical Observatories Particle Physics Accelerators
CERN FermiLab
Biology Labs Human Genome Project
Issues Distributed observations
Human Genome Project Collaborative observations
Catalogs
What is (and is not) Recorded
Distributed Nature of Catalogs
Synonyms in Catalogs
Missing and Error Values in Catalogs
Permanence of Catalogs
What is (and is not) recorded
Not record “outside of bounds” Space
Outside of area of interest Time (maybe Money!)
Budget ran out, no more money for sensing Magnitude (strength)
“Outside of bounds” may be a “soft” concept USGS records earthquake
All Mw > 7
Most Mw > 6
Has Mw ≈ 5 and below close to seismometers
Q: Is it true that the only small earthquakes are close to seismometers?
Distributed Nature of Catalogs
Distributed databases Hopefully designed properly
Redundancy Reliably online up-time
Central repository Centralizes data available from other dbs
Example: some biochemical dbs May use
inconsistent terms, different missing value conventions, etc
May be out-of-date relative to source dbs Format may be “least common denominator” of
more specialized formats
Synonyms and Different Meanings for a Given Word
Concepts ASCII or UNICODE string
Consider the term “robin” In N. America:
“American robin” “Turdus migratorius
In Europe “European robin” “Erithacus rubecula”
What about these terms? “thrush family bird” “songbird” “bird”
Ontologies can help (if used)
Missing and Errors Values
Missing values often coded as sentinel values There might be more than one! Esp. for “required” fields
Clerks were required to get customer telephone # Most common tele number? (111) 111-1111
Errors values Probably a bigger problem before computers Computers can solve some problems:
Sanity rules Parity, checksum (corruption), nonce (security)
Computers can add problems too NASA launched satellite to look at atmospheric ozone Satellite reported very low O
3 levels by south pole
Data was not believed, BUT WAS TRUE (Ozone hole)
Permanence of Catalog
How long does the media last? Is the data viewed as temporary or worth
archiving? Is more data generated than can be stored? Proprietary? Subject to privacy issues?
Backup policy? Organization that holds the data
Tech savvy? Have money? Well-organized in general?
Redundant sites? Banks do this, but they have $ and are protecting $
The Importance of Prediction
Scientists do a lot of things: Prediction underlies much
Scientific assertions Numbers: Equations Concepts: Rules, decision trees, production sys Probabilities: Graphical models
Next time!
Equations
“Strongest” form of knowledgeDefines relationships between quantities
F = ma Force and acceleration acting in same direction Mass is scalar
For N variables: N-1 knowns -> 1 unknown
Scalar or Vector F = ma
Equations, cont'd
Are they always generalizations?The Drake Equation:
N = R* * fp * n
e * f
l * f
i * f
c * L
Where:N = Number of civilizations emitting radio transmissionsR* = rate of Sun-like star formationfp = fraction of stars with planets
ne = average number of inhabitable planets
fl,f
i,f
c = probability of life, intelligence and civilization
L = average duration of a civilizationSo far N = 0, is this equation useful?Do we know any of these terms well?What if N was non-zero, would the eqn be useful?
Computing with symbols
Rules If dog(X) and wet(X) then smelly(X)
Decision trees
Production systems Like rules, but special memory to hold what is
true
Simulations
Dimensions Deterministic or stochastic? Discrete or continuous? Steady-state or time-varying? Linear or non-linear? Batch or real-time?
Phases of development1. Real system to logical abstraction2. Appropriate datastructs/algorithms/math technique3. Implement algorithm as program4. Validate implementation
Real System to Logical Abstraction
1. Define the problem What are the objectives? What question(s) would you like to answer? What do you anticipate answering in the future?
2. When in doubt, leave detail out You may not need it
“A theory should be as simple as it needs to be, and no simpler” A. Einstein
You can always add it in future3. Spend the time turning vague statement of
goals into better defined one(s) Revisit as necessary (That's just good software engineering!)
Appropriate Datastructure/Algorithm/Math Technique
Setting up the model Start with what's known Go simple -> complex Iterate Develop large models modularly Model only what is necessary Make (and state) assumptions and hypotheses State constraints Alternate between top-down & bottom up
Best data structure/algorithm? Data Structures and algorithms class
Best math technique? Numerical methods class
Validation
Define some “test cases” before hand Define (input,output) pairings Should range in complexity Should be checked:
By scientists, or By well-known (e.g. “textbook”) cases, or By hand (and then double-checked by someone else)
Just good software engineering!
Implementation
Procedural languagesObject oriented languagesArtificial Intelligence LanguagesProduction systemsSimulation languages
Procedural Languages
Examples Fortran, C
Fast Compilers map language to machine code easily Decades worth of optimizing compilers
Large body of libraries for scientific Among first programming languages Decades worth of optimizing libraries
Have to think in machine terms, not domain terms
Object Oriented Languages
Examples C++, Java, C#, (Eiffel, Smalltalk)
Object Oriented Nature Classes can represent scientific objects Methods can represent scientific transitions C++ can link with C libraries
Still may be a hassle to program objects
Artificial Intelligence Languages
Examples Prolog, Lisp, Scheme
Prolog: Predicate Logic (Horn Clause) basedFacts:
sentiveNose(sniffer).dog(fido). wet(fido).
Rules:smelly(X) :- dog(X), wet(X).offendedBy(Y,X) :- smelly(X), sensitiveNose(Y).
Possible queries:Is sniffer offended by fido? offendedBy(sniffer,fido).Does fido offend anyone? offendedBy(A,fido).Is sniffer offended by anyone? offendedBy(sniffer,B).Is anyone offended by anyone else? offendedBy(A,B).
Artificial Intelligence Langs (2)
Lisp It's all functions operating on lists:
(defun computeProperties (object newObj) (if (and (getProperties object dog) (getProperties object wet) ) (addProperty object smelly newObj) ))
Artificial Intelligence Langs (3)
Advantages Encourage thinking in domain space Prolog: modular rules Lisp: functions operating on lists good for
symbolic processing
Disadvantages: Prolog: not good for floating pt numbers
Exact matching not compatible with float pt rounding Lisp:
Handling loops and variables somewhat contrived
Production Systems
Examples SOAR, OPS5, Mycin?
Models human thought
Production sys = Rules + Working Memory Rules = if-then statements
if dog(X) and wet(X) then smelly(X)if smelly(X) and sensitiveNose(Y) then offendedBy(Y,X)
Working memory = models human memorydog(fido), wet(fido), sensitiveNose(sniffer)
Computation1st round: compute smelly(fido)2nd round: compute offendedBy(sniffer,fido)
Production Systems, cont'dAdvantages
Productions are inherently modular Can have constraints on working memory elements
(“wme's”) Better models human cognitive limitations
RETE algorithm matches wmes and rules efficiently Rules = laws, WMEs = state
When rules conflict, what happens? Architecture has to do something OPS5: More specific rule wins SOAR: Create problem space to resolve issue MYCIN: Rules w/certainty factors (weighted vote) Is what the architecture does what you want?
Simulations Languages
Examples QSIM, SPICE (analog circuits), Scilab
Qualitative simulations Very precise, not very general:
d2x/dt2(t) = -9.8 m/sec2, x(t0) = 2m, v(t
0) = 0 m/sec
Less precise, more generald2x/dt2(t) = -g, x(t
0) = x
0, v(t
0) = v
0
Even less precise, even more generaldx/dt = M- (monotonically dec. fnc),x(t
0) = 0 .. x
0 .. infinity
QSIM (1)Values
Either landmark values0, MAX_CAPACITY, t
0, inf
Or ranges between landmark valuesNot full but not empty: (0,MAX_CAPACITY),Time before t
1 but after t
0: (t
0,t
1)
Variables:Have both a value and derivative
<0,inc> <(0,MAX_CAPACITY),inc><MAX_CAPACITY,std>
Functions:Either monotonically increasing or decreasing
M+, M-
Derivatives: either increasing, decreasing or steady:inc, dec, std.
QSIM (2): Classic Example
U-Tube: Two tubes connected by pipe Both have
Some initial fluid levelamtA: 0..AMAX..inf amtB: 0..BMAX..inf
Pressure dependent upon levelpressureA = M+(amtA) pressureB = M+(amtB)
QSIM (3)Other constraints:
Total fluid only in A & B, and is constant:amtA + amtB = total constant(total)
Flow depends on pressure difference:pAB = pressureA – pressureBflowAB = M+(pAB)d(amtB)/dt = flowAB d(amtA)/dt = -flowAB
Knowledge of quantities:total: 0..infamtA: 0..AMAX..inf amtB: 0..BMAX..infpressureA: 0..inf pressureB: 0..infpAB: -inf..0..infflowAB: -inf..0..inf
Correspondence between valuespressureA & amtA: (0,0), (inf,inf)pressureB & amtB: (0,0), (inf,inf)flowAB & pAB: (-inf, -inf), (0,0), (inf,inf)
QSIM (4)
Note correspondence w/ordinary diffy eqnd(amtB)/dt = f(g(amtA) – h(amtB))f,g,h M+
Qualitative state is dynamicImagine filling tank A (ignore tank B)Need to represent amtA and its derivative:amtA(t):
t0: <0,inc>
(t0,t
1): <(0,AMAX),inc>
t1: <AMAX,std>
QSIM (5): Predicting behavior
1. Give initial (at least some) conditions:
“Tank A full” t = t
0: amtA = <AMAX,?>
“Tank B empty”t = t
0: amtB = <0,?>
QSIM (6):
2. Correspondence propagation on init stateamtB = 0 therefore pressureB = 0amtA = AMAX therefore pressureA =
(0,inf)amtA=AMAX && amtB = 0 therefore total = (0,inf)pressureA = (0,inf) and pressureB = 0 therefore pAB = (0,inf)constant(total) therefore d(total)/dt = std pAB = (0,inf) therefore flowAB = (0,inf)flowAB = (0,inf) therefore d(amtA)/dt = dec d(amtB)/dt = incd(amtA)/dt = dec therefore d(pressureA)/dt = dec.d(amtB)/dt = dec therefore d(pressureB)/dt = dec.
QSIM (7)
(Init state correspondence propagate, cont'd)Additional constraint
d(pressureA)/dt = dec && d(pressureB)/dt = inc therefore d(pAB)/dt = dec
Last propagationd(pAB)/dt = dec therefore d(flowAB)/dt = dec
So we have at t = t0:
amtA = <AMAX,dec> pressureA = <(0,inf),dec>amtB = <0,inc> pressureB = <0,inc>pAB = <(0,inf), dec> flowAB = <(0,inf), dec>Total = <(0,inf), std>
QSIM (8)
3. Predicting the next state:What events can make the next state?
1. A variable can reach a limit or landmark2. A variable may move off a landmark3. A variable may start or stop moving
6 variables moving to limits/landmarks(theoretically): 46 - 1 = 4095 possibilities (actual): constraints & correspondences reduce choices
t = (t
0,t
1):
amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>amtB = <(0,BMAX),inc> pressureB = <(0,inf),inc>pAB = <(0,inf), dec> flowAB = <(0,inf), dec>Total = <(0,inf), std>
QSIM (9)What happens after?
t = (t0,t
1):
amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>amtB = <(0,BMAX),inc> pressureB = <(0,inf),inc>pAB = <(0,inf), dec> flowAB = <(0,inf),
dec>total = <(0,inf), std>
Well:amtA is decreasing toward 0 (won't get there)amtB is increasing toward BMAX.flowAB is decreasing toward 0.So either:
1. flowAB gets to 0 AND amtB gets to BMAX, or2. flowAB gets to 0 BEFORE amtB gets to BMAX, or3. amtB gets to BMAX BEFORE flowAB gets to 0
(Tank B overflows)
QSIM (10)1. flowAB gets to 0 AND amtB gets to BMAX:
t = t1(a)
:amtA = <(0,AMAX),std> pressureA = <(0,inf),std>amtB = <BMAX,std> pressureB = <(0,inf),std>pAB = <0,std> flowAB = <0,std>Total = <(0,inf), std>
3. amtB gets to BMAX BEFORE flowAB gets to 0(Tank B overflows)t = t
1(c):
amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>amtB = <BMAX,inc> pressureB = <(0,inf),inc>pAB = <(0,inf),dec> flowAB = <(0,inf),dec>Total = <(0,inf),std>
QSIM (11)
2. flowAB gets to 0 BEFORE amtB gets to BMAXt = t
1(b):
amtA = <(0,AMAX),std> pressureA = <(0,inf),std>amtB = <(0,BMAX),std> pressureB = <(0,inf),std>pAB = <0,std> flowAB = <0,std>Total = <(0,inf), std>
amtB's level becomes a new landmark so:amtA: 0..a
0..AMAX..inf
pressureA: 0..p0..inf
amtB: 0..a1..BMAX..inf
pressureB: 0..p1..inf
pAB: -inf..0..infflowAB: -inf..0..inftotal: 0..to
0..inf
QSIM (12)
Output: graph of state transitions:
QSIM (13)QSIM algorithm:
1. Init queue w/QState(t0), complete its description
2. If queue empty or exceed resource limit then stop, otherwise pop state S from queue
3. For each var vi in S get successors from table
4. Determine all successor states consistent w/restrictions If S is time-interval, delete copies of itself
5. For each successor Si, assert:
successor(S,Si) predecessor(S
i,S)
6. Apply global filters on successors7. Add successors to queue
Don't add inconsistent, quiescent, cycles, transition or t = inf states
8. Go to 2
Next Time
Probabilistic reasoning Graphical Models
Machine Learning and CSD