big data and iot
TRANSCRIPT
![Page 1: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/1.jpg)
Big Data Mining and Internet of Things
Presented By-
Shubham Singh(40004796)
Shubhangi Sheel(40004793)
![Page 2: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/2.jpg)
![Page 3: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/3.jpg)
![Page 4: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/4.jpg)
![Page 5: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/5.jpg)
![Page 6: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/6.jpg)
![Page 7: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/7.jpg)
![Page 8: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/8.jpg)
![Page 9: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/9.jpg)
ProblemsPaper 1: Data Mining with Big Data
Modeling big data characteristics (HACE Theorem)
Identify key challenges for big data mining
Paper 2: IOT-StatisticDB: A General Statistical Database Cluster Mechanism for Big Data Analysis in the Internet of Things
Sensor sampling data is huge, heterogeneous and have totally different formats and semantics
No statistical in database kernel analysis techniques available for IoT data
Most of the existing statistical analysis methods are centralized solutions, unsuited forIoT
![Page 10: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/10.jpg)
Kind of data we are talking about?
Searching on Google with “Yan Mo Nobel Prize,” resulted in 1,050,000 web pointers
News media
Comments on social network
Cross-referenced discussions by critics
Square Kilometer Array (SKA) in radio astronomy consists of 1,000 to 1,500 dishes (15-meter) in a central 5-km area in South Africa and Australia
It provides 100 times more sensitive vision than any existing radio telescopes
It generates 40 gigabytes (GB)/second data volume
Existing methods can only work in an offline fashion and
are incapable of handling this Big Data scenario in real time
![Page 11: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/11.jpg)
BIG DATA CHARACTERISTICS: HACE THEOREM
H: Heterogeneous
A: Autonomous Sources
C: Complex Data
E: Evolving Relationships
![Page 12: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/12.jpg)
‘H’ for Heterogeneity
Heterogeneous and diverse dimensionalities
Different schemata and protocols
Example: An individual is represented by
Demographic Information: Text (gender, age , family disease history etc.)
X-ray Examination: Image
CT Scan: Image/ video
DNA or genomic related test: Image (microarray expression images and sequences)
![Page 13: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/13.jpg)
‘A’ for Autonomous sources with distributed and decentralized Control
Autonomous data sources with distributed and decentralized controls
Example: World Wide Web (WWW): Each web server provides a certain information and is able to fully function independently
Google, Flicker, Facebook, Walmart: Have large number of server farms deployed all over the world
Local legislations are different
Seasonal promotions
Top selling items
Customer behavior
![Page 14: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/14.jpg)
‘C’ for Complex Data and ‘E’ for Evolving Relationships
In centralized information systems, the focus is on finding best feature values to represent each observation
Example: Facebook or Twitter
An individual is represented by features but the social connections which is the most important factor of human society is not taken into account
In a dynamic world, the features evolve with respect to temporal, spatial, and other factors.
![Page 15: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/15.jpg)
Clustered data
Linear regression
Central core with 3 flaresLoopy behavior
Clustered data
![Page 16: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/16.jpg)
DATA MINING CHALLENGES WITH BIG DATA
![Page 17: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/17.jpg)
DATA MINING CHALLENGES WITH BIG DATA
Tier III: Big Data Mining Algorithms
Tier II: Big Data Semantics and Application Knowledge
Tier I: Big Data Mining Platform
![Page 18: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/18.jpg)
Tier I: Big Data Mining Platform
A computing platform requires two resources: Hard disks and Processors
Big data is distributed, so parallel computing and collective mining is used
Frameworks rely on cluster computers with a high performance computing platforms such as MapReduce or Enterprise Control Language
Example: Super computer Titan, deployed at Oak Ridge National Laboratory in Tennessee,
contains 18,688 nodes each with a 16-core CPU.
![Page 19: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/19.jpg)
Elephant in the room
![Page 20: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/20.jpg)
Data Privacy
![Page 21: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/21.jpg)
Tier II: Big Data Semantics and Application Knowledge
Information Sharing and Data Privacy
Restrict access to the data
Anonymize data fields
![Page 22: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/22.jpg)
Domain and Application knowledge
Identify right features for modeling the underlying data
Example: Blood glucose level is clearly a better feature than body mass in diagnosing Type II diabetes
Tier II: Big Data Semantics and Application Knowledge
![Page 23: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/23.jpg)
Tier III: Big Data Mining Algorithms
Local Learning and Model Fusion for Multiple Information Sources
Mining distributed data often leads to biased view of the data resulting in biased decisions or models
To overcome this, we need to enable information exchange and fusion mechanisms to ensure global optimization goal i.e. local mining and global correlations
![Page 24: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/24.jpg)
Mining from Sparse, Uncertain, and Incomplete Data
Sparse, uncertain, and incomplete data are defining features for Big Data applications.
Sparse data
number of data points are too few for drawing reliable conclusions
Uncertain data
Data field is no longer deterministic but is subject to some random/error distributions
Data item is represented as sample distributions but notas a single value, so most existing data mining algorithmscannot be directly applied
Incomplete data
Incomplete data refers to the missing of data field values forsome samples
Data imputation is an established research field that seeksto impute missing values to produce improved models
![Page 25: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/25.jpg)
Conclusion
HACE theorem suggests that the key characteristics of the Big Data are
Huge with heterogeneous and diverse data sources,
Autonomous with distributed and decentralized control,
Complex and Evolving in data and knowledge
Analyzed several challenges at the data, model and system levels
Analyzed challenges in Data mining:
Information Sharing and Data Privacy
Domain and Application knowledge
Data Mining Algorithms
![Page 26: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/26.jpg)
![Page 27: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/27.jpg)
Paper 2: IOT-StatisticDB: A General Statistical Database Cluster Mechanism for Big Data Analysis in the Internet of Things
This paper discusses :
A generalized schemata to store different sensor data
Distributed architecture for parallel computing for IoT
Statistical analysis techniques and relevant operators
![Page 28: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/28.jpg)
Architecture of IOT-StatisticDB
![Page 29: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/29.jpg)
IoT Generalized Schema
SensorID(String)
SensorType(String)
DeployedBy(String)
DepoyedTime(Instant)
Samplings(SamplingSequence)
Samplings
![Page 30: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/30.jpg)
Definitions
1. Traffic Network: Net = (E, N)
I. E is set of e defined as the form e = (eid, geo, len, nids, nide)
II. N is set of n is defined as the form n = (nid, loc,(eid)m i-1 ,mat)
III. Net = (E, N)
Node Region/ Service Area
![Page 31: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/31.jpg)
IOT table and Data Distribution at IoT-Storage and Statistics Layer
![Page 32: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/32.jpg)
2. SamplingValue = (t, loc, npos, schema, value)
* Note: Sampling value can be considered as a data type which defines the type of data from the sensors
3. SamplingComponent = (cSchema, cValue)
e.g. (“speed: real”, 62.5) or (“direction: real”, 22)
4. SamplingSequence = (schema, (ti, loci, nposi, valuei, flagi)ni-1
Types of SensorsTime (t) Location(loc)
Networkposition(npos)
Schema Value
Temperature t1 39.5, 145.2 null “temperature: real” 27.5
GPSt2
39.3, 144.3 e201“speed: real, direction:
real”(62.5, 22)
Windt3
38.2, 142.8 Null“windspeed: real,
winddir: real”(62.5, 22)
Vitalized valuefrom Traffic
Video Camera
t439.7, 142.1 e202
“averageSpeed: real,jam: bool”
(62.5, true)
![Page 33: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/33.jpg)
Query Operators for Data Retrieval and for Statistical Analysis
*Format: FunctionName (Input Parameters) -> Output
Truncation Operators:
1. truncateGeo (SamplingSequence*Region) ->SamplingSequence
2. truncateTime (SamplingSequence*Periods)->SamplingSequence
3. atInstant (SamplingSequence* Instant )-> SamplingValue
Types of SensorsTime (t) Location(loc)
Networkposition(npos)
Schema Value
Temperature t1 39.5, 145.2 null “temperature: real” 27.5
GPSt2
39.3, 144.3 e201“speed: real, direction:
real”(62.5, 22)
Windt3
38.2, 142.8 Null“windspeed: real,
winddir: real”(62.5, 22)
Vitalized valuefrom Traffic
Video Camera
t439.7, 142.1 e202
“averageSpeed: real,jam: bool”
(62.5, true)
![Page 34: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/34.jpg)
Projection Operators:
Component Extraction Operator:
getComponent: SamplingValue*integer -> SamplingComponent
Statistical Analysis Operators
spatialAggrEU: String *String -> Region
spatialAggrNet: String* String-> Lines
parameterAggrEU: String*String-> Real
parameterAggrNet: String *String-> Set(String *String)
Sampling-Sequence-Based Projections Sampling-Value-Based Projections
sProjectLines: SamplingSequence -> Lines //for moving sensorssProjectPoint: SamplingSequence -> Point //for static sensorssProjectNetPos: SamplingSequence->Set(String)sProjectTime: SamplingSequence -> Periods
vProjectPoint: SamplingValue-> PointvProjectNetPos: SamplingValue-> StringvProjectTime: SamplingValue -> Instant
![Page 35: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/35.jpg)
Euclidean-Based Spatial Aggregation
Q1: If the task is to find area in BeijingGeo where the pollution level is above 450 at time t.
Qdata = “SELECT sProjectPoint(Samplings) FROM IoTData
WHERE SensorType = “PollutionSensor”
AND inside(sProjectPoint(Samplings), BeijingGeo)
AND getComponent(atInstant(Samplings, t), 1) > 450”;
Select spatialAggrEU (Qdata, DBScan (distance1, number1))
Algorithm:
INPUT: Qdata: String; // Statistical raw data collection query
cMethodPara: String;
// Clustering method and its parameters;
OUTPUT: R: Region;
1. queryRegion = GetQueryRange(Qdata);
2. Nodes = {node | area(node) queryRegion Ø}
3. FOR node Nodes DO IN PARALLEL
4. StatisticalRawData = Execute(Qdata);
5. R (node) = clusterContour(StatisticalRawData, cMethodPara);
6. SendMaster(R (node));
7. ENDFOR;
8. Results = {R(node) | node Nodes};
9. R = regionMerge(Results);
10. Return (R).
![Page 36: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/36.jpg)
Network-Based Spatial Aggregation
Q2: If task is to find area blocked edge sections with vehicle speed lower than 5 km/h) at time t in the traffic network of Beijing area
Qdata = “SELECT atInstant(Samplings, t) FROM IoTData
WHERE SensorType = “VehicleGPS” AND inside(sProjectPoint (atInstant(Samplings, t)), BeijingGeo)
AND getComponent(atInstant(Samplings, t), 1) < 5”;
Select spatialAggrNet (Qdata, DBScanNet(distance1, number1))
Algorithm:
INPUT: Qdata: String; //Raw data collection querycMethodPara:String; //clustering method& parameters;
TrafficNet: Net; //the traffic network;OUTPUT: R: Lines;1. queryRegion = GetQueryRange(Qdata);2. Nodes = {node | area(node) queryRegion Ø}3. FOR node Nodes DO IN PARALLEL4. StatisticalRawData = Execute(Qdata);5. R (node) = netClusterLines(StatisticalRawData, trafficNet, cMethodPara);6. SendMaster(R(node));7. ENDFOR;8. Results = {R(node) | node Nodes};9. R = linesMerge(Results);10. Return (R).
![Page 37: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/37.jpg)
Euclidean-based Parameter Aggregation
Q3: If task is to find the average pollution level at time t in BeijingGeo.Qdata=“SELECT getComponent(atInstant(Samplings, t), 1)
FROM IoTData
WHERE SensorType = “PollutionSensor”
AND inside(sProjectPoint(Samplings), BeijingGeo)”;
Select parameterAggrEU (Qdata, Average)
Algorithm:
INPUT: Qdata: String; //Raw data collection query
method: String; //aggregation method
OUTPUT: R: Real;
1. queryRegion = GetQueryRange(Qdata);
2. Nodes = {node | area(node) queryRegion Ø}
3. FOR node Nodes DO IN PARALLEL
4. StatisticalRawData = Execute(Qdata);
5. R (node) = aggregate(StatisticalRawData, method);
6. N (node) = |StatisticalRawData|;
7. SendMaster(R(node), N(node));
8. ENDFOR;
9. Results = {(R(node), N(node)) | node Nodes};
10. R = valueMerge(Results, method);
11. Return (R).
![Page 38: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/38.jpg)
Network-based Parameter Aggregation
Q4: If task is to find the traffic flow parameters at time t for each edge in BeijingGeo.Qdata= “SELECT sTruncateTime(sTruncateGeo (Samplings, BeijingGeo), [ t - 5*Minute, t ])
FROM IoTDataWHERE SensorType = “VehicleGPS””
Select parameterAggrNet (Qdata, TrajectoryAnalysis);
Algorithm:
INPUT: Qdata:String; //Raw data collection query
method: String; //aggregation method
OUTPUT: R; //of the form Set((edgeID:string, para: string))
1. queryRegion = GetQueryRange(Qdata);
2. Nodes = {node | area(node) queryRegion Ø}
3. FOR node Nodes DO IN PARALLEL
4. StatisticalRawData = Execute(Qdata);
5. R (node) = trafficAnalysis(StatisticalRawData, method);
6. SendMaster(R (node));
7. ENDFOR;
8. Results = {R(node) | node Nodes};
9. R = edgeBasedValueMerge(Results);
10. Return (R).
![Page 39: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/39.jpg)
Experimental Studies
The prototype system contained one master server and 2~32 node servers.
The real GPS trajectory data was collected from 20,000 taxi cabs in Beijing and the average GPS sampling frequency was 30 seconds.
The sampling sequence data of 200,000 static sensors was generated through simulation and the average sampling frequency of static sensors was 5 minutes.
Compared with: Centralized Statistical Analysis with Data Source Distributed (CSA-DSD): It stores sensor sampling data in a distributed manner among multiple node servers but has one master server to do all the statistical analysis
We performed above 4 queries on both IoT and CSA-DSD and compare the query time response against numbers of nodes and number of sensors.
![Page 40: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/40.jpg)
Query response time vs. number of nodes
![Page 41: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/41.jpg)
Query response time vs. no. of sensors
![Page 42: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/42.jpg)
Conclusions
A generalized schemata to store different sensor data was proposed
Proposed architecture to store data in distributed manner and parallel computing in real time basis
Statistical analysis operators were defined
Algorithms for statistical analysis of IoT data was proposed.
Experimental results were compared with other similar framework.
![Page 43: Big Data and IOT](https://reader034.vdocuments.pub/reader034/viewer/2022042619/587137701a28abf0568b60a7/html5/thumbnails/43.jpg)